Closed Colseph closed 5 years ago
good idea. ill look at it when i have some time
I like it but we probably need a way to choose just a few servers. Instead of spanning the whole list.
yeah, that'd probably be a good plan. might try something like having it pick random ones, then add them to a 'selected.txt' if one hasnt already been generated. the number of randomly selected servers can be changed in the config section, then users can choose if they want just one, or even all. (whats a good default number?) Then, if they want to get new servers, they just need to delete 'selected.txt' and itll grab new ones. I got work, but ill test it tomorrow when i have time is someone doesn't beat me to it.
I think the best case would be to pick different IPs based on the city... it wouldnt make sense to just scan Moscow IPs? ... here are the top 5 cities the IPs i loaded are from .... i can add more detail to another file where we can iterate through random IPs from the 5 cities...
Moscow Saint Petersburg Krasnoyarsk Yekaterinburg Krasnodar
I have a rudimentary script (legit just a while loop) that ive been using and just using head -n 500 servers.txt > serverstoping.txt
to select servers at random. I presume this will NOT fit your quality standards and am open to improving it.
I wonder if it possible to have a public document that the script could reach out to and grab an IP that has the least allocations. Load balancing in principal.
its possible, assuming we wont ever use all of the IP's, just pull/push IP's in use from something in a cloud, or just add a check in/out system and you'd need to prove that youre actually using it...
could use curl w/ php & a sql db on a vps or something. initially the script would curl & request ip's and the server would return the ones w/ least amount of users. then every loop(could use timestamps to make sure it doesnt run too often and overload the server ie only run if time is greater than..) it would send the list of ips its using along w/ an id. if the server doesnt get a POST w/ that id containing certian ip for x time, it sets the user counter back by 1? so we'd need each script to have a unique id like a hash or something.. could generate a hash from /dev/urandom input, then save to id.txt or something so it keeps the same id?
but we could have the problem of the server getting DDOS'd and someone could prevent an NTP from being traced by spoofing POSTs containing its ip and a random id
idk either way it would be hard to actually prove ip's are being used
Yeah, security is always going to be an issue. If we restrict within a login, then those have to be managed, and down the rabbit hole we go.
Maybe we can just have another thread where we just respond with IPs we are working on and ask for everyone to just grab randomly. For example, I just went and grabbed like 2500-2600 on the list.
I hate to say it, but decentralization is key, and the only place I know that this is in blockchain.....
Wait cant we just use one of those real time document editing apps that runs over IPFS? that way a bunch of people can all claim IPs at the same time and only those with the IPNS or hash will have access
Also could make a link editable google spreadsheet. gross.
That would be too close to being centralized I thought
The IPFS idea isn't the spreadsheet is. But honestly it's just a list of IPs its kinda hard for anyone to submit a takedown reason to google. That and google isn't on too great terms with Russia.
added option for user to choose how many random servers to put in selected_servers.txt, heres the merge ready branch. i also added a tmux wrapper that i can remove if you want me to do a pull request, (its more of an experiment really)
just saw morrowc's pull request uses sort -R
instead of shuf
which i feel like is more widely installed by default across distros, and probably a better choice.
i did come across an somewhat minor issue with the ITER variable that is present in both @morrowc 's and my code
for your filesystem problem(s) I think you may consider something like: 'in the workingdir create 2 or 3 levels of directory: a-z0-9 | a-z0-9 | a-z0-9
this gives quite a spread on directories to fill with files, is easy to create on-demand (loop of loop of loop) and spreads your files created over 363636 possible end directories. you can mechanically create the path in the working while-loop easily as well, using something like: $ echo $(dd if=/dev/urandom bs=512 count=1 2>&1) | md5sum | tail -1 | awk '{print $1}' 3f4a2b80252e27c333f1983557a6bc59
to get 'random' enough data to build the directory / path. expanding a bit: echo $(dd if=/dev/urandom bs=512 count=1 2>&1) | md5sum | tail -1 | awk '{print $1}' | cut -b1,2,3 --output-delimiter=/ 0/4/5
you get a new directory to put on the file each run...
@morrowc I think that it would be better to have it be time based (unix or a more standard year-month-day-hour) so you know what you're looking at and you can debug issues etc (remember the human!)
humans are fallible, depend on machines for all of this. really, at the scale of the file count being created 'person' never is going to look at this. you'll always have a machine do the initial 'find me the file (with find)' then look: "Oh, this is why fileX fails loading into the next-stage"
_checkPath() {
for one in {a..z} $(seq 0 9); do for two in {a..z} $(seq 0 9); do for three in {a..z} $(seq 0 9); do mkdir -p ${WORKING_DIR}/${one}/${two}/${three} done done done _log date "[_checkPath]checking directory $1" }
makes the directories properly. (I'll send a pr with this bit)
PR #17 has the above change and the random directory bits.
i think i got a fix for the ITER & COMP_ITER variables! **i still need to actually test it IN the script but if it works, we'll get:
ITER
and COMP_ITER
valuesonly con(i can think of) is if people want a fresh start they need to delete the save files
@LogoiLab if you dont want the entire iteration mumbo jumbo, i can make pull request w/ just this, so we have persisting ITER
and COMP_ITER
variables across script crashes or stop/starts
So this is solved by PR#15 and PR#16? we can close this since we merged?
i guess
the problem: we wanted to make sure every server gets traced or as close to it as possible.
the current solution:
grep -v '^#' ${SERVERS} | grep -v '^$' | /usr/bin/sort -R | head -${PROBES}
now to run the script, you need to supply a number for the amount of servers you want
bash ./ping_russia.sh 10 #if i wanted 10 random servers from servers.txt
NOTE:(every time its random, so if you stop the script then start it again, you might have completely different servers)
ITER
/COMP_ITER
variablesthe original problem: the counter variables are global (not only literally but also in that they are shared by all the servers we iterate over)
so i run the script,
it scans server1 and you get 0.1234567.old
and 0.1234579.new
ok no problem ..yet
now it scans server2 and you get 1.1234597.old
and 1.1234607.new
wait.. server2 has ITER
val of 1
when this is the first time its been scanned..
theres the problem. also when you tar, it resets the ITER
back to zero.
and since its shared, they all go back to zero.
the current fix:
make a very large amount of directories for each of the data file pairs data files
for one in {a..z} $(seq 0 9); do
for two in {a..z} $(seq 0 9); do
for three in {a..z} $(seq 0 9); do
mkdir -p $1/${one}/${two}/${three}
...
DIR=$(echo $(dd if=/dev/urandom bs=512 count=1 2>&1) | md5sum | tail -1 | awk '{print $1}' | cut -b1,2,3 --output-delimiter=/)
echo ${DIR}
ill be honest, the whole DIR=$(echo $(dd if=/dev/urandom bs=512 count=1 2>&1) | md5sum | tail -1 | awk '{print $1}' | cut -b1,2,3 --output-delimiter=/)
line is pretty darn cool
as for the reasoning:
@morrowc will have to shed some light on this. if i understand correctly, this was to stop clobbering. except i dont think clobbering was really an issue to begin with. the $TIME
vraiable generated by date +%s
stopped that. the problem was the $ITER
server variables interfering with eachother(because they are the same variable). but to be honest, that was more my OCD than an actual problem since your going to be using $TIME
along with the timestamps for parsing etc..
ITER
/COMP_ITER
variables, problem solved (as it wasnt really a big problem to begin with)_ill go ahead and close this issue since "server claiming" has been implemented, and thats what why i opened it.
however there are some things you might want to look at.
organization
_checkPath makes subdirs a-z 0-9
in every path that is checked
*3 levels deep
_thats like 47990 folders_(according to find . -type d | wc -l
), _and thats only counting from one of the two directories that are checked_
thats ALSO ~188M which means as the script is now**, youll just be making a bunch of tarballs full of empty folders and 1 .new
and 1 .old
data file.
also, the tar archive is created with the name of the current server in the loop which is fine. (thats how mine works)
but since the data isnt oganized by server, youll get a tarball called something like
2.2342343.123.456.789.tar.xz
when it also has files from 987.654.321
and 111.111.111
etc...
this means when you go to parse, youll need to extract the tarball (which you'd do anyways). but then to get the server, the parsing script needs to read the contents of each data file to know which server it belongs to(which might actually be a good plan to do anyways since scripts might mess up etc..)
so it might not be a huge issue.
anyways, good luck.
i guess
claiming servers
the problem: we wanted to make sure every server gets traced or as close to it as possible. the current solution:
grep -v '^#' ${SERVERS} | grep -v '^$' | /usr/bin/sort -R | head -${PROBES}
now to run the script, you need to supply a number for the amount of servers you wantbash ./ping_russia.sh 10 #if i wanted 10 random servers from servers.txt
claiming servers problem solved
NOTE:(every time its random, so if you stop the script then start it again, you might have completely different servers)
ITER
/COMP_ITER
variablesthe original problem: the counter variables are global (not only literally but also in that they are shared by all the servers we iterate over) so i run the script, it scans server1 and you get
0.1234567.old
and0.1234579.new
ok no problem ..yet now it scans server2 and you get1.1234597.old
and1.1234607.new
wait.. server2 hasITER
val of1
when this is the first time its been scanned.. theres the problem. also when you tar, it resets theITER
back to zero. and since its shared, they all go back to zero.
Perhaps the question to ask here is: "What is the ITER/COMP_ITER supposed to provide you?" Don't add things to your data you don't know what to do with, or manage.
If you want to make sure the traceroute data has a known time sequence, then add the time (unix timestamp for instance) to the filename. The actual number of times you've been over any particular IP (the iteration number, or number of iterations) is not important in the filename, when you can: find . -name *ip* | wc -l
or similar... Or, really: "Parse the files into a database, deal with the stats from there"
the current fix: make a very large amount of directories for each of the data file pairs data files
1. make directories
for one in {a..z} $(seq 0 9); do for two in {a..z} $(seq 0 9); do for three in {a..z} $(seq 0 9); do mkdir -p $1/${one}/${two}/${three} ...
2. then we randomly pick a folder to put the data in
DIR=$(echo $(dd if=/dev/urandom bs=512 count=1 2>&1) | md5sum | tail -1 | awk '{print $1}' | cut -b1,2,3 --output-delimiter=/) echo ${DIR}
ill be honest, the whole
DIR=$(echo $(dd if=/dev/urandom bs=512 count=1 2>&1) | md5sum | tail -1 | awk '{print $1}' | cut -b1,2,3 --output-delimiter=/)
line is pretty darn coolas for the reasoning: @morrowc will have to shed some light on this. if i understand correctly, this was to stop clobbering. except i dont think clobbering was really an issue to begin with. the
$TIME
vraiable generated bydate +%s
stopped that. the problem was the$ITER
server variables interfering with eachother(because they are the same variable). but to be honest, that was more my OCD than an actual problem since your going to be using$TIME
along with the timestamps for parsing etc..
The problem is not clobbering, it's that very few filesystems behave well when you put large numbers of files into the directory. Flat file systems are never a good idea if you need to scan them later (tar them, list them, stat all the files, etc). All systems that generate lots (or could generate lots) of files split the files out over many, many (this exact same hashed mechanism) directories, because the performance of the filesystem degrades significantly as the number of files in the directory grows large.
_
ITER
/COMP_ITER
variables, problem solved (as it wasnt really a big problem to begin with)_ill go ahead and close this issue since "server claiming" has been implemented, and thats what why i opened it.
however there are some things you might want to look at.
organization _checkPath makes subdirs
a-z 0-9
in every path that is checked _3 levels deep_* thats like 47990 folders(according tofind . -type d | wc -l
), and thats only counting from one of the two directories that are checked thats ALSO ~188M which means as the script is now, youll just be making a bunch of tarballs full of empty folders and 1.new
and 1.old
data file.
the empty directories aren't important, you care about the files, you'll iterate over the files in the filesystem and pull data from them as required. If you want less hash/splay, then just make 2 levels not 3, but really that isn't important here. What is important is not killing your system when you fill the directory with 40k files/etc.
also, the tar archive is created with the name of the current server in the loop which is fine. (thats how mine works) but since the data isnt oganized by server, youll get a tarball called something like
2.2342343.123.456.789.tar.xz
when it also has files from987.654.321
and111.111.111
etc... this means when you go to parse, youll need to extract the tarball (which you'd do anyways). but then to get the server, the parsing script needs to read the contents of each data file to know which server it belongs to(which might actually be a good plan to do anyways since scripts might mess up etc..) so it might not be a huge issue.
yes, filenames are immaterial, save (perhaps) the timestamp. everything else you need is, in fact, in the file itself. People will (and should) never have to see the files, stop making names that are significant for anything except (perhaps) the timestamp of the creation.
anyways, good luck.
i had a whole right up, nicely formatted.. but honestly idc so, sure. gg.
i would recommend testing and debugging before you PR tho as for me, i think imma play some nekopara :P
I keep on saying that the filestructure should be human-readable. Even if no one should ever read it theres no point to need to randomize everything, because at somepoint SOMEONE will want to read it for whatever reason, and its not like it hurts us to do this
At Fri, 15 Feb 2019 15:36:00 +0000 (UTC), gidoBOSSftw5731 notifications@github.com wrote:
[1 <text/plain; UTF-8 (7bit)>] I keep on saying that the filestructure should be human-readable. Even if no one should ever read it theres no point to need to randomize everything, because at somepoint SOMEONE will want to read it for whatever reason, and its not like it hurts us to do this
that is wrong headed. There are machines for this, and tooling to find files.
saw this mentioned on reddit by /u/turn-down-for-what what if we had the script iterate through the servers? the structure would look something like
and for servers.txt - it'll only use lines that dont have a '#' in them, so you can add comments(ie, the pools and explaination (and the # can be anywhere in the line)) i get that it wouldnt be as many traces per minute per sever, but it might get a better overall image? and if we had enough people running it, i feel like itd have pretty good coverage.
pros:
cons:
ive got it implemented and it seems to be working fine, ill get it up on my fork and you can see if you like it, if so i can submit a pull request
might be worth only having some people run it this way? idk