mrcbax / frwl

From Russia with love, lets traceroute the coming shutdown.
GNU General Public License v3.0
46 stars 10 forks source link

servers.txt server "Claiming" #4

Closed Colseph closed 5 years ago

Colseph commented 5 years ago

saw this mentioned on reddit by /u/turn-down-for-what what if we had the script iterate through the servers? the structure would look something like

.
├── from_russia_with_love_comp
│   ├── ip1
│   │   └── 0.2342342.ip1.tar.xz
│   ├── ip2
│   │   └── 0.2342342.ip2.tar.xz
│   └── ip3
│       └── 0.2342342.ip3.tar.xz
├── frwl.2019-02-12.log
├── hashes.txt
├── LICENSE
├── ping_russia.sh
├── README.md
├── servers.txt
└── working_dir
    ├── ip1
    │   ├── 0.1231243.ip1.new
    │   └── 0.1235345.ip1.old
    ├── ip2
    │   ├── 0.1231243.ip2.new
    │   └── 0.1235345.ip2.old
    └── ip3
        ├── 0.1231243.ip3.new
        └── 0.1235345.ip3.old

and for servers.txt - it'll only use lines that dont have a '#' in them, so you can add comments(ie, the pools and explaination (and the # can be anywhere in the line)) i get that it wouldnt be as many traces per minute per sever, but it might get a better overall image? and if we had enough people running it, i feel like itd have pretty good coverage.

pros:

cons:

ive got it implemented and it seems to be working fine, ill get it up on my fork and you can see if you like it, if so i can submit a pull request

might be worth only having some people run it this way? idk

mrcbax commented 5 years ago

good idea. ill look at it when i have some time

Colseph commented 5 years ago

got the changes up here

mrcbax commented 5 years ago

I like it but we probably need a way to choose just a few servers. Instead of spanning the whole list.

Colseph commented 5 years ago

yeah, that'd probably be a good plan. might try something like having it pick random ones, then add them to a 'selected.txt' if one hasnt already been generated. the number of randomly selected servers can be changed in the config section, then users can choose if they want just one, or even all. (whats a good default number?) Then, if they want to get new servers, they just need to delete 'selected.txt' and itll grab new ones. I got work, but ill test it tomorrow when i have time is someone doesn't beat me to it.

TamirAl commented 5 years ago

I think the best case would be to pick different IPs based on the city... it wouldnt make sense to just scan Moscow IPs? ... here are the top 5 cities the IPs i loaded are from .... i can add more detail to another file where we can iterate through random IPs from the 5 cities...

Moscow Saint Petersburg Krasnoyarsk Yekaterinburg Krasnodar

gidoBOSSftw5731 commented 5 years ago

I have a rudimentary script (legit just a while loop) that ive been using and just using head -n 500 servers.txt > serverstoping.txt to select servers at random. I presume this will NOT fit your quality standards and am open to improving it.

danukefl commented 5 years ago

I wonder if it possible to have a public document that the script could reach out to and grab an IP that has the least allocations. Load balancing in principal.

gidoBOSSftw5731 commented 5 years ago

its possible, assuming we wont ever use all of the IP's, just pull/push IP's in use from something in a cloud, or just add a check in/out system and you'd need to prove that youre actually using it...

Colseph commented 5 years ago

could use curl w/ php & a sql db on a vps or something. initially the script would curl & request ip's and the server would return the ones w/ least amount of users. then every loop(could use timestamps to make sure it doesnt run too often and overload the server ie only run if time is greater than..) it would send the list of ips its using along w/ an id. if the server doesnt get a POST w/ that id containing certian ip for x time, it sets the user counter back by 1? so we'd need each script to have a unique id like a hash or something.. could generate a hash from /dev/urandom input, then save to id.txt or something so it keeps the same id?

but we could have the problem of the server getting DDOS'd and someone could prevent an NTP from being traced by spoofing POSTs containing its ip and a random id

idk either way it would be hard to actually prove ip's are being used

danukefl commented 5 years ago

Yeah, security is always going to be an issue. If we restrict within a login, then those have to be managed, and down the rabbit hole we go.

Maybe we can just have another thread where we just respond with IPs we are working on and ask for everyone to just grab randomly. For example, I just went and grabbed like 2500-2600 on the list.

gidoBOSSftw5731 commented 5 years ago

I hate to say it, but decentralization is key, and the only place I know that this is in blockchain.....

mrcbax commented 5 years ago

Wait cant we just use one of those real time document editing apps that runs over IPFS? that way a bunch of people can all claim IPs at the same time and only those with the IPNS or hash will have access

mrcbax commented 5 years ago

Also could make a link editable google spreadsheet. gross.

gidoBOSSftw5731 commented 5 years ago

That would be too close to being centralized I thought

mrcbax commented 5 years ago

The IPFS idea isn't the spreadsheet is. But honestly it's just a list of IPs its kinda hard for anyone to submit a takedown reason to google. That and google isn't on too great terms with Russia.

Colseph commented 5 years ago

added option for user to choose how many random servers to put in selected_servers.txt, heres the merge ready branch. i also added a tmux wrapper that i can remove if you want me to do a pull request, (its more of an experiment really)

Colseph commented 5 years ago

just saw morrowc's pull request uses sort -R instead of shuf which i feel like is more widely installed by default across distros, and probably a better choice.

i did come across an somewhat minor issue with the ITER variable that is present in both @morrowc 's and my code

morrowc commented 5 years ago

for your filesystem problem(s) I think you may consider something like: 'in the workingdir create 2 or 3 levels of directory: a-z0-9 | a-z0-9 | a-z0-9

this gives quite a spread on directories to fill with files, is easy to create on-demand (loop of loop of loop) and spreads your files created over 363636 possible end directories. you can mechanically create the path in the working while-loop easily as well, using something like: $ echo $(dd if=/dev/urandom bs=512 count=1 2>&1) | md5sum | tail -1 | awk '{print $1}' 3f4a2b80252e27c333f1983557a6bc59

to get 'random' enough data to build the directory / path. expanding a bit: echo $(dd if=/dev/urandom bs=512 count=1 2>&1) | md5sum | tail -1 | awk '{print $1}' | cut -b1,2,3 --output-delimiter=/ 0/4/5

you get a new directory to put on the file each run...

gidoBOSSftw5731 commented 5 years ago

@morrowc I think that it would be better to have it be time based (unix or a more standard year-month-day-hour) so you know what you're looking at and you can debug issues etc (remember the human!)

morrowc commented 5 years ago

humans are fallible, depend on machines for all of this. really, at the scale of the file count being created 'person' never is going to look at this. you'll always have a machine do the initial 'find me the file (with find)' then look: "Oh, this is why fileX fails loading into the next-stage"

morrowc commented 5 years ago

_checkPath() {

Creates all paths required in the working directory.

for one in {a..z} $(seq 0 9); do for two in {a..z} $(seq 0 9); do for three in {a..z} $(seq 0 9); do mkdir -p ${WORKING_DIR}/${one}/${two}/${three} done done done _log date "[_checkPath]checking directory $1" }

makes the directories properly. (I'll send a pr with this bit)

morrowc commented 5 years ago

PR #17 has the above change and the random directory bits.

Colseph commented 5 years ago

i think i got a fix for the ITER & COMP_ITER variables! **i still need to actually test it IN the script but if it works, we'll get:

only con(i can think of) is if people want a fresh start they need to delete the save files

@LogoiLab if you dont want the entire iteration mumbo jumbo, i can make pull request w/ just this, so we have persisting ITER and COMP_ITER variables across script crashes or stop/starts

mrcbax commented 5 years ago

So this is solved by PR#15 and PR#16? we can close this since we merged?

Colseph commented 5 years ago

i guess

claiming servers

the problem: we wanted to make sure every server gets traced or as close to it as possible. the current solution: grep -v '^#' ${SERVERS} | grep -v '^$' | /usr/bin/sort -R | head -${PROBES} now to run the script, you need to supply a number for the amount of servers you want bash ./ping_russia.sh 10 #if i wanted 10 random servers from servers.txt

claiming servers problem solved

NOTE:(every time its random, so if you stop the script then start it again, you might have completely different servers)

ITER/COMP_ITER variables

the original problem: the counter variables are global (not only literally but also in that they are shared by all the servers we iterate over) so i run the script, it scans server1 and you get 0.1234567.old and 0.1234579.new ok no problem ..yet now it scans server2 and you get 1.1234597.old and 1.1234607.new wait.. server2 has ITER val of 1 when this is the first time its been scanned.. theres the problem. also when you tar, it resets the ITER back to zero. and since its shared, they all go back to zero.

the current fix: make a very large amount of directories for each of the data file pairs data files

1. make directories

for one in {a..z} $(seq 0 9); do
    for two in {a..z} $(seq 0 9); do
      for three in {a..z} $(seq 0 9); do
        mkdir -p $1/${one}/${two}/${three}
...

2. then we randomly pick a folder to put the data in

DIR=$(echo $(dd if=/dev/urandom bs=512 count=1 2>&1) | md5sum | tail -1 | awk '{print $1}' | cut -b1,2,3 --output-delimiter=/)
echo ${DIR}

ill be honest, the whole DIR=$(echo $(dd if=/dev/urandom bs=512 count=1 2>&1) | md5sum | tail -1 | awk '{print $1}' | cut -b1,2,3 --output-delimiter=/) line is pretty darn cool

as for the reasoning: @morrowc will have to shed some light on this. if i understand correctly, this was to stop clobbering. except i dont think clobbering was really an issue to begin with. the $TIME vraiable generated by date +%s stopped that. the problem was the $ITER server variables interfering with eachother(because they are the same variable). but to be honest, that was more my OCD than an actual problem since your going to be using $TIME along with the timestamps for parsing etc..

_ITER/COMP_ITER variables, problem solved (as it wasnt really a big problem to begin with)_

ill go ahead and close this issue since "server claiming" has been implemented, and thats what why i opened it.

however there are some things you might want to look at.

organization _checkPath makes subdirs a-z 0-9 in every path that is checked *3 levels deep _thats like 47990 folders_(according to find . -type d | wc -l), _and thats only counting from one of the two directories that are checked_ thats ALSO ~188M which means as the script is now**, youll just be making a bunch of tarballs full of empty folders and 1 .new and 1 .old data file.

also, the tar archive is created with the name of the current server in the loop which is fine. (thats how mine works) but since the data isnt oganized by server, youll get a tarball called something like 2.2342343.123.456.789.tar.xz when it also has files from 987.654.321 and 111.111.111 etc... this means when you go to parse, youll need to extract the tarball (which you'd do anyways). but then to get the server, the parsing script needs to read the contents of each data file to know which server it belongs to(which might actually be a good plan to do anyways since scripts might mess up etc..) so it might not be a huge issue.

anyways, good luck.

morrowc commented 5 years ago

i guess

claiming servers

the problem: we wanted to make sure every server gets traced or as close to it as possible. the current solution: grep -v '^#' ${SERVERS} | grep -v '^$' | /usr/bin/sort -R | head -${PROBES} now to run the script, you need to supply a number for the amount of servers you want bash ./ping_russia.sh 10 #if i wanted 10 random servers from servers.txt

claiming servers problem solved

NOTE:(every time its random, so if you stop the script then start it again, you might have completely different servers)

ITER/COMP_ITER variables

the original problem: the counter variables are global (not only literally but also in that they are shared by all the servers we iterate over) so i run the script, it scans server1 and you get 0.1234567.old and 0.1234579.new ok no problem ..yet now it scans server2 and you get 1.1234597.old and 1.1234607.new wait.. server2 has ITER val of 1 when this is the first time its been scanned.. theres the problem. also when you tar, it resets the ITER back to zero. and since its shared, they all go back to zero.

Perhaps the question to ask here is: "What is the ITER/COMP_ITER supposed to provide you?" Don't add things to your data you don't know what to do with, or manage.

If you want to make sure the traceroute data has a known time sequence, then add the time (unix timestamp for instance) to the filename. The actual number of times you've been over any particular IP (the iteration number, or number of iterations) is not important in the filename, when you can: find . -name *ip* | wc -l

or similar... Or, really: "Parse the files into a database, deal with the stats from there"

the current fix: make a very large amount of directories for each of the data file pairs data files

1. make directories

for one in {a..z} $(seq 0 9); do
    for two in {a..z} $(seq 0 9); do
      for three in {a..z} $(seq 0 9); do
        mkdir -p $1/${one}/${two}/${three}
...

2. then we randomly pick a folder to put the data in

DIR=$(echo $(dd if=/dev/urandom bs=512 count=1 2>&1) | md5sum | tail -1 | awk '{print $1}' | cut -b1,2,3 --output-delimiter=/)
echo ${DIR}

ill be honest, the whole DIR=$(echo $(dd if=/dev/urandom bs=512 count=1 2>&1) | md5sum | tail -1 | awk '{print $1}' | cut -b1,2,3 --output-delimiter=/) line is pretty darn cool

as for the reasoning: @morrowc will have to shed some light on this. if i understand correctly, this was to stop clobbering. except i dont think clobbering was really an issue to begin with. the $TIME vraiable generated by date +%s stopped that. the problem was the $ITER server variables interfering with eachother(because they are the same variable). but to be honest, that was more my OCD than an actual problem since your going to be using $TIME along with the timestamps for parsing etc..

The problem is not clobbering, it's that very few filesystems behave well when you put large numbers of files into the directory. Flat file systems are never a good idea if you need to scan them later (tar them, list them, stat all the files, etc). All systems that generate lots (or could generate lots) of files split the files out over many, many (this exact same hashed mechanism) directories, because the performance of the filesystem degrades significantly as the number of files in the directory grows large.

_ITER/COMP_ITER variables, problem solved (as it wasnt really a big problem to begin with)_

ill go ahead and close this issue since "server claiming" has been implemented, and thats what why i opened it.

however there are some things you might want to look at.

organization _checkPath makes subdirs a-z 0-9 in every path that is checked _3 levels deep_* thats like 47990 folders(according to find . -type d | wc -l), and thats only counting from one of the two directories that are checked thats ALSO ~188M which means as the script is now, youll just be making a bunch of tarballs full of empty folders and 1 .new and 1 .old data file.

the empty directories aren't important, you care about the files, you'll iterate over the files in the filesystem and pull data from them as required. If you want less hash/splay, then just make 2 levels not 3, but really that isn't important here. What is important is not killing your system when you fill the directory with 40k files/etc.

also, the tar archive is created with the name of the current server in the loop which is fine. (thats how mine works) but since the data isnt oganized by server, youll get a tarball called something like 2.2342343.123.456.789.tar.xz when it also has files from 987.654.321 and 111.111.111 etc... this means when you go to parse, youll need to extract the tarball (which you'd do anyways). but then to get the server, the parsing script needs to read the contents of each data file to know which server it belongs to(which might actually be a good plan to do anyways since scripts might mess up etc..) so it might not be a huge issue.

yes, filenames are immaterial, save (perhaps) the timestamp. everything else you need is, in fact, in the file itself. People will (and should) never have to see the files, stop making names that are significant for anything except (perhaps) the timestamp of the creation.

anyways, good luck.

Colseph commented 5 years ago

i had a whole right up, nicely formatted.. but honestly idc so, sure. gg.

i would recommend testing and debugging before you PR tho as for me, i think imma play some nekopara :P

gidoBOSSftw5731 commented 5 years ago

I keep on saying that the filestructure should be human-readable. Even if no one should ever read it theres no point to need to randomize everything, because at somepoint SOMEONE will want to read it for whatever reason, and its not like it hurts us to do this

morrowc commented 5 years ago

At Fri, 15 Feb 2019 15:36:00 +0000 (UTC), gidoBOSSftw5731 notifications@github.com wrote:

[1 <text/plain; UTF-8 (7bit)>] I keep on saying that the filestructure should be human-readable. Even if no one should ever read it theres no point to need to randomize everything, because at somepoint SOMEONE will want to read it for whatever reason, and its not like it hurts us to do this

that is wrong headed. There are machines for this, and tooling to find files.