Open bzz opened 6 years ago
@bzz It must be 3TB, not 2.4. Either something went wrong during the download or the index misses some repos. We measured 3TB from our local HDFS copy. This is critical to find out few weeks before the paper presentation...
We could add a pga status
sub command, which shows what is the current version of the index and how many files have been successfully downloaded which match the md5 hash.
Would that help, @bzz ?
@vmarkovtsev or there were just network issues while downloading on my side.
Sounds like a great idea, @campoy!
So, the approach would be to add pga status <path>
wich would verify that all the Siva files in path are present and are of the right size and may be eventually have the right md5? We could also md5 list of all md5s to speed high-level integrity check.
If we agree that it's the best approach, I'll be happy to look into submitting a patch for this early next week (after getting back to :flag-es: from a short vacation).
it seems that #58 should make pga get
safe to run multiple times. If that is the case - please let me know, and would be happy to try and report back the results on full PGA.
Ok, so I've been thinking about this and I found a problem.
We do not keep how the list of downloaded .siva
files was obtained, and it is actually probably impossible to do, since on top of pga get
you could also (by @vmarkovtsev's request) do something like pga list ... | grep ... | awk ... | pga get -i
.
So everything we can tell is whether the files we have in a directory are valid or not. We can't tell whether they're up to date, since the new version of this file will be completely independent.
Say I download all repositories under my user name pga get -u /campoy/
.
Then after a week or so one of my repositories becomes popular and goes from 0 to 100 stars, which means it will be added to Public Git Archive eventually.
When I do pga status
of my directory containing the .siva
files I downloaded with pga get -u /campoy/
I should say that they're all OK ... but that's pretty much everything I can do.
These are a couple of possible improvements to Public Git Archive and pga
itself that could help with the limitation explained above.
An alternative to this is to simply read all the .siva
files, find the repositories they correspond to, and obtain the list of .siva
files for those repos in the newer version.
.siva
files by name, and instead only allow downloading files given the corresponding repository.Unfortunately, this doesn't solve the main problem of a new repository being added to Public Git Archive, but it is a prerequisite for the following improvement.
The main idea is to keep the pga
command that gave us the current dataset. Let's say I want to download all of the repositories that contain some go under the google org: pga get -l go -u /google/
.
We would somehow (format TBD) store those filters in some file in the destination directory.
This allows us to then provide a pga update
or pga status
that shows whether repositories have been added or removed to the result of our query, and whether new siva files should be downloaded.
While I think providing the feature described in 3
is pretty cool, I'm not sure there's an actual need for it.
What do you all think?
@campoy I have an impression that we are reinventing a huge wheel here, but I cannot list any particular prior.
I note that the Torrent protocol can be handy here: same speed as HTTPS, optional offloading to other mirrors, checksum checks, partial downloads, tracking of the origin. Does not solve the problem with saving the selector query though.
This goes really serious, I agree that it seems a bit too much at this point.
I think for now, simpler solution still can be useful: a status
that would only verify the consistency between current index and location in -o
.
That means that pga status
has to behave same way as pga get
does, accepting -i
flag, to verify consistency on sub-set of repositories.
On the case of
Newer versions of Public Git Archive ..
I would assume that a different version of PGA would have a different index file, so status could accept a particular "version" or however generations of updates in PGA are called.
I have put together a very simple status
based on recent md5 work by @campoy - will try it on full PGA tomorrow and see how useful is that.
Agree, that any solution more complex than that sounds like an overengineering at this point of the project.
This allows us to then provide a
pga update
Sounds interesting, but I would suggest keep discussion on pga update
on a separate issue.
HDFS, as part of protocol, only provides an md5 of rolling 512 bytes crc32s :/ so there is no way to get md5 sum of the whole file, without streaming it to the client.
Right now pga status -j64
has very accurate ETA 9h based on observed net throughput ~70MiB/s
Yeah, I saw that HDFS didn't provide md5 so decided to implement it by reading the whole file. Far from perfect, but better than having corrupt files, and I'm going with the assumption that the connection to HDFS is in general much faster than the one to our servers.
The whole status
command would make more sense if we don't download a new version of the index by default. But still, once you've upgraded to the new index (let's say it's pga upgrade
) how do you handle corrupted files? Maybe the best solution is not to have this tool, but rather to allow people to download a file (over HTTPS, FTP, or Torrent - good point @vmarkovtsev)
If the download fails for any reason they might need to re-download the whole dataset though. But I'm thinking it might be a problem worth having for not getting into the business of building downloaders.
There seems to be problems about identifying the version of the downloaded dataset. But the dataset should probably be versioned as a whole, and it would be a good idea that pga get
stores the version (e.g. /data/pga/VERSION
or /data/pga/MANIFEST
).
For local downloads, maybe we want to provide rsync
access? rsync comes with built-in integrity checks, safe-cancel+resume, incremental updates and include/exclude by pattern and specific file lists (--files-from
).
For any of the methods, I think we should not download files in-place (https://github.com/src-d/datasets/issues/39), but use a temporary file and move later. This ensures (or makes very unlikely) that corrupt files are present (they should be either present and correct or absent in the final dataset). By the way, this is what rsync and HDFS do.
I agree with using rsync
, but I have never worked with it other than as a user level.
Do you think the engineering effort to set it up is affordable, @smola?
@campoy I think @rporres uses it for other internal stuff so there should be no problem server side. We'd probably need to get the pga
tool to be able to generate the right rsync
calls depending on the index. I think it is feasible, but we should talk with @mcuadros to get a priority for these and then plan for it.
You can use rsync over ssh to get files from pga server. No problem at all.
In some previous discussions I've said that BitTorrent might not be able to handle this amount of files, but after a quick test, I think it actually can. ctorrent
can create a torrent file for 400k files in less than one hour and transmission-gtk
can open the torrent just fine.
Ok, second pass of pga get
has finished but the size in HDFS did not change much
2.4 T hdfs://hdfs-namenode/pga/siva/latest
I did a mistake of logging STDERR and STDOUT at the same time, which makes loges un-readable as it all looks like this
-----] 0.00% 27s^M 9 / 257401 [>----------------------------------------------------------------------------------------------------------------------------------------------------] 0.00% 26s^M 10 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------] 0.00% 24s^M 11 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------] 0.00% 23s^M 12 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------] 0.00% 22s^M 13 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------] 0.01% 22s^M 14 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------] 0.01% 21s^M 15 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------] 0.01% 21s^M 16 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------] 0.01% 21s^M 17 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------] 0.01% 20s^M 18 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------] 0.01% 20s^M 19 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------] 0.01% 20s^M 20 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------] 0.01% 19s^M 21 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------] 0.01% 19s^M 22 / 257401 [>------------------------------------------------------------------------------------------------------------------#
Command I used was:
pga get -j 96 -o hdfs://hdfs-namenode:8020/pga 2>&1 | tee -a pga-get-2.log
but
grep -c "could not get" pga-get-2.log
1579
which usually means a file has beed already downloaded before
could not get siva/latest/d6/d6ded0e91bcdd2a8f7a221f6a5552a33fe545359.siva: could not create /pga/siva/latest/d6/d6ded0e91bcdd2a8f7a221f6a5552a33fe545359.siva: create /pga/siva/latest/d6/d6ded0e91bcdd2a8f7a221f6a5552a33fe545359.siva: file already exists
but it seems there should be much more :/
$ zgrep "\.siva" ~/.pga/latest.csv.gz | sort | uniq | wc -l
181481
$ hdfs dfs -ls -R hdfs://hdfs-namenode/pga/siva/latest | grep "\.siva$" | sort | uniq | wc -l
239807
Will run it again to capture only STDERR
@bzz I would collect the list of file names with sizes and compare it to the list retrieved from the server (you can ask Rafa to run any listing command on the server).
@vmarkovtsev will do that.
Meanwhile, have updated the .siva
files count above and it is very strange
😕 is that something expected, or am I doing something stupid, how do you guys think?
The number of lines in index matches, the number of siva files should be around 270k. This means 30k were not indexed and it is very, very bad.
So before moving forward, we need to index the siva files which were discarded.
Ok, .siva count of the Index is wrong above, here are updated numbers
$ zgrep -o "[0-9a-z]*\.siva" ~/.pga/latest.csv.gz | sort | uniq | wc -l
239807
$ hdfs dfs -ls -R hdfs://hdfs-namenode/pga/siva/latest | grep -c "\.siva$"
239807
~that means that after 2 runs of pga get
there are still 17594 .siva files missing~
@vmarkovtsev One more try 🥇 and now it seems that actually, all the files from the index were downloaded on the second pga get
round!
Should we document this result as current "best practice" of first approximation of download integrity verification?
As a user, I would very much prefer to use something more automated that also includes some archive integrity verification, etc
@bzz This is great news! I am so happy you failed to download them two times and this is not an indexing issue!
I keep on thinking about having this dataset in a git repository. That would allow us to use git and all the tools around it rather than reimplementing them from scratch.
If anyone in engineering wants to play with the idea, that'd be awesome!
I keep on thinking about having this dataset in a git repository.
@mcuadros also proposed something similar in the past
From the logs of the 3rd pga get
attempt
time="2018-06-13T00:19:34Z" level=warning msg="could not check md5 hashes for siva/latest/2f/2f8ec84fb873345973c7671f0bf455687b72b982.siva, comparing timestamps instead: could not fetch hash at siva/latest/2f/2f8ec84fb873345973c7671f0bf455687b72b982.siva.md5: 404 Not Found"
time="2018-06-13T00:19:38Z" level=warning msg="could not check md5 hashes for siva/latest/3b/3be46690f46f27f3e671de9a615d24d1554b9991.siva, comparing timestamps instead: could not fetch hash at siva/latest/3b/3be46690f46f27f3e671de9a615d24d1554b9991.siva.md5: 404 Not Found"
time="2018-06-13T00:19:39Z" level=warning msg="could not check md5 hashes for siva/latest/77/77138002cad60ffde2de45c01a7d72275cdc7e9a.siva, comparing timestamps instead: could not fetch hash at siva/latest/77/77138002cad60ffde2de45c01a7d72275cdc7e9a.siva.md5: 404 Not Found"
time="2018-06-13T00:19:40Z" level=warning msg="could not check md5 hashes for siva/latest/4c/4cbfd46e3d1020bfe7a87bc6dc8b2952100a7c16.siva, comparing timestamps instead: could not fetch hash at siva/latest/4c/4cbfd46e3d1020bfe7a87bc6dc8b2952100a7c16.siva.md5: 404 Not Found"
time="2018-06-13T00:19:43Z" level=warning msg="could not check md5 hashes for siva/latest/d6/d6c7789c349d3f6fe5eb5ce9ff0a9ee1901238cd.siva, comparing timestamps instead: could not fetch hash at siva/latest/d6/d6c7789c349d3f6fe5eb5ce9ff0a9ee1901238cd.siva.md5: 404 Not Found"
time="2018-06-13T00:19:44Z" level=warning msg="could not check md5 hashes for siva/latest/78/78f5c3123dde27170c47ad9e5c95f9f507d550cb.siva, comparing timestamps instead: could not fetch hash at siva/latest/78/78f5c3123dde27170c47ad9e5c95f9f507d550cb.siva.md5: 404 Not Found"
After another run of full PGA download:
More details in https://github.com/src-d/datasets/pull/69#issuecomment-398053657
Early next week will try to 🔥 https://github.com/smola/checksum-spark by @smola and report which of 2 runs are more complete
@bzz I ran it yesterday, there's generated checksums for the /pga2
directory in the directory itself. I did not have time to check the results though. Note that with just 3 workers it takes a few hours to run.
I've verified that /pga2
directory matches with reference md5 provided by @rporres, but there are 42662 missing files. Those that are present are correct.
It seems part of the missing files are actually not in the index, so it is expected they are not in the downloaded copy.
Those that are present are correct.
🎉 🙇 to #69
42662 missing files. It seems part of the missing files are actually not in the index, so it is expected they are not in the downloaded copy.
This was something @vmarkovtsev was worried about. Are all 42662 missing from the index?
@bzz Not all of them are missing from the index, my guess is that part of them were missing just because of a crash in the download process?
I have downloaded full PGA to HDFS using
pga get -v ...
and have full logs of the process.It took a day and has finished eventually and can see that it's
Questing: how do I make sure that nothing is missing?
What would be the simplest way to verify consistency and completeness of the results of
pga get
? Options from the top of my head includehdfs dfs -ls -R hdfs://hdfs-namenode/pga/ |grep "\.siva$" | wc -l
= 239807 andpga list | wc -l
= 181481 but it's rooted repository VS actual repository :/Would appreciate any recommendations.
@campoy I guess this might be something that is worth documenting eventually, as other users might have same question eventually. Would be happy to submit a PR.