src-d / datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code
Other
320 stars 82 forks source link

Question on full PGA integrity verification #53

Open bzz opened 6 years ago

bzz commented 6 years ago

I have downloaded full PGA to HDFS using pga get -v ... and have full logs of the process.

It took a day and has finished eventually and can see that it's

$ hdfs dfs -du -s -h hdfs://hdfs-namenode/pga
2.4 T  hdfs://hdfs-namenode/pga

Questing: how do I make sure that nothing is missing?

What would be the simplest way to verify consistency and completeness of the results of pga get ? Options from the top of my head include

Would appreciate any recommendations.

@campoy I guess this might be something that is worth documenting eventually, as other users might have same question eventually. Would be happy to submit a PR.

vmarkovtsev commented 6 years ago

@bzz It must be 3TB, not 2.4. Either something went wrong during the download or the index misses some repos. We measured 3TB from our local HDFS copy. This is critical to find out few weeks before the paper presentation...

campoy commented 6 years ago

We could add a pga status sub command, which shows what is the current version of the index and how many files have been successfully downloaded which match the md5 hash.

Would that help, @bzz ?

bzz commented 6 years ago

@vmarkovtsev or there were just network issues while downloading on my side.

Sounds like a great idea, @campoy!

So, the approach would be to add pga status <path> wich would verify that all the Siva files in path are present and are of the right size and may be eventually have the right md5? We could also md5 list of all md5s to speed high-level integrity check.

If we agree that it's the best approach, I'll be happy to look into submitting a patch for this early next week (after getting back to :flag-es: from a short vacation).

bzz commented 6 years ago

it seems that #58 should make pga get safe to run multiple times. If that is the case - please let me know, and would be happy to try and report back the results on full PGA.

campoy commented 6 years ago

Ok, so I've been thinking about this and I found a problem.

Current limitation

We do not keep how the list of downloaded .siva files was obtained, and it is actually probably impossible to do, since on top of pga get you could also (by @vmarkovtsev's request) do something like pga list ... | grep ... | awk ... | pga get -i.

Side effect of the limitation

So everything we can tell is whether the files we have in a directory are valid or not. We can't tell whether they're up to date, since the new version of this file will be completely independent.

Say I download all repositories under my user name pga get -u /campoy/. Then after a week or so one of my repositories becomes popular and goes from 0 to 100 stars, which means it will be added to Public Git Archive eventually.

When I do pga status of my directory containing the .siva files I downloaded with pga get -u /campoy/ I should say that they're all OK ... but that's pretty much everything I can do.

Possible solutions

These are a couple of possible improvements to Public Git Archive and pga itself that could help with the limitation explained above.

  1. Newer versions of Public Git Archive provide a way to find the corresponding file for a filename in a previous version.

An alternative to this is to simply read all the .siva files, find the repositories they correspond to, and obtain the list of .siva files for those repos in the newer version.

  1. Remove the capacity of downloading specific .siva files by name, and instead only allow downloading files given the corresponding repository.

Unfortunately, this doesn't solve the main problem of a new repository being added to Public Git Archive, but it is a prerequisite for the following improvement.

  1. Materialize the query that created the downloaded dataset.

The main idea is to keep the pga command that gave us the current dataset. Let's say I want to download all of the repositories that contain some go under the google org: pga get -l go -u /google/.

We would somehow (format TBD) store those filters in some file in the destination directory. This allows us to then provide a pga update or pga status that shows whether repositories have been added or removed to the result of our query, and whether new siva files should be downloaded.

Personal opinion

While I think providing the feature described in 3 is pretty cool, I'm not sure there's an actual need for it.

What do you all think?

vmarkovtsev commented 6 years ago

@campoy I have an impression that we are reinventing a huge wheel here, but I cannot list any particular prior.

I note that the Torrent protocol can be handy here: same speed as HTTPS, optional offloading to other mirrors, checksum checks, partial downloads, tracking of the origin. Does not solve the problem with saving the selector query though.

This goes really serious, I agree that it seems a bit too much at this point.

bzz commented 6 years ago

I think for now, simpler solution still can be useful: a status that would only verify the consistency between current index and location in -o.

That means that pga status has to behave same way as pga get does, accepting -i flag, to verify consistency on sub-set of repositories.

On the case of

Newer versions of Public Git Archive ..

I would assume that a different version of PGA would have a different index file, so status could accept a particular "version" or however generations of updates in PGA are called.

I have put together a very simple status based on recent md5 work by @campoy - will try it on full PGA tomorrow and see how useful is that.

Agree, that any solution more complex than that sounds like an overengineering at this point of the project.

This allows us to then provide a pga update

Sounds interesting, but I would suggest keep discussion on pga update on a separate issue.

bzz commented 6 years ago

HDFS, as part of protocol, only provides an md5 of rolling 512 bytes crc32s :/ so there is no way to get md5 sum of the whole file, without streaming it to the client.

Right now pga status -j64 has very accurate ETA 9h based on observed net throughput ~70MiB/s

campoy commented 6 years ago

Yeah, I saw that HDFS didn't provide md5 so decided to implement it by reading the whole file. Far from perfect, but better than having corrupt files, and I'm going with the assumption that the connection to HDFS is in general much faster than the one to our servers.

The whole status command would make more sense if we don't download a new version of the index by default. But still, once you've upgraded to the new index (let's say it's pga upgrade) how do you handle corrupted files? Maybe the best solution is not to have this tool, but rather to allow people to download a file (over HTTPS, FTP, or Torrent - good point @vmarkovtsev)

If the download fails for any reason they might need to re-download the whole dataset though. But I'm thinking it might be a problem worth having for not getting into the business of building downloaders.

smola commented 6 years ago

There seems to be problems about identifying the version of the downloaded dataset. But the dataset should probably be versioned as a whole, and it would be a good idea that pga get stores the version (e.g. /data/pga/VERSION or /data/pga/MANIFEST).

For local downloads, maybe we want to provide rsync access? rsync comes with built-in integrity checks, safe-cancel+resume, incremental updates and include/exclude by pattern and specific file lists (--files-from).

For any of the methods, I think we should not download files in-place (https://github.com/src-d/datasets/issues/39), but use a temporary file and move later. This ensures (or makes very unlikely) that corrupt files are present (they should be either present and correct or absent in the final dataset). By the way, this is what rsync and HDFS do.

campoy commented 6 years ago

I agree with using rsync, but I have never worked with it other than as a user level. Do you think the engineering effort to set it up is affordable, @smola?

smola commented 6 years ago

@campoy I think @rporres uses it for other internal stuff so there should be no problem server side. We'd probably need to get the pga tool to be able to generate the right rsync calls depending on the index. I think it is feasible, but we should talk with @mcuadros to get a priority for these and then plan for it.

rporres commented 6 years ago

You can use rsync over ssh to get files from pga server. No problem at all.

smola commented 6 years ago

In some previous discussions I've said that BitTorrent might not be able to handle this amount of files, but after a quick test, I think it actually can. ctorrent can create a torrent file for 400k files in less than one hour and transmission-gtk can open the torrent just fine.

bzz commented 6 years ago

Ok, second pass of pga get has finished but the size in HDFS did not change much

2.4 T hdfs://hdfs-namenode/pga/siva/latest

I did a mistake of logging STDERR and STDOUT at the same time, which makes loges un-readable as it all looks like this

-----]   0.00% 27s^M 9 / 257401 [>----------------------------------------------------------------------------------------------------------------------------------------------------]   0.00% 26s^M 10 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------]   0.00% 24s^M 11 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------]   0.00% 23s^M 12 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------]   0.00% 22s^M 13 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------]   0.01% 22s^M 14 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------]   0.01% 21s^M 15 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------]   0.01% 21s^M 16 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------]   0.01% 21s^M 17 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------]   0.01% 20s^M 18 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------]   0.01% 20s^M 19 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------]   0.01% 20s^M 20 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------]   0.01% 19s^M 21 / 257401 [>---------------------------------------------------------------------------------------------------------------------------------------------------]   0.01% 19s^M 22 / 257401 [>------------------------------------------------------------------------------------------------------------------#

Command I used was: pga get -j 96 -o hdfs://hdfs-namenode:8020/pga 2>&1 | tee -a pga-get-2.log

but

grep -c "could not get" pga-get-2.log
1579

which usually means a file has beed already downloaded before

could not get siva/latest/d6/d6ded0e91bcdd2a8f7a221f6a5552a33fe545359.siva: could not create /pga/siva/latest/d6/d6ded0e91bcdd2a8f7a221f6a5552a33fe545359.siva: create /pga/siva/latest/d6/d6ded0e91bcdd2a8f7a221f6a5552a33fe545359.siva: file already exists

but it seems there should be much more :/

$ zgrep "\.siva" ~/.pga/latest.csv.gz | sort | uniq | wc -l
181481

$ hdfs dfs -ls -R hdfs://hdfs-namenode/pga/siva/latest | grep "\.siva$" | sort | uniq | wc -l
239807

Will run it again to capture only STDERR

vmarkovtsev commented 6 years ago

@bzz I would collect the list of file names with sizes and compare it to the list retrieved from the server (you can ask Rafa to run any listing command on the server).

bzz commented 6 years ago

@vmarkovtsev will do that.

Meanwhile, have updated the .siva files count above and it is very strange

😕 is that something expected, or am I doing something stupid, how do you guys think?

vmarkovtsev commented 6 years ago

The number of lines in index matches, the number of siva files should be around 270k. This means 30k were not indexed and it is very, very bad.

vmarkovtsev commented 6 years ago

So before moving forward, we need to index the siva files which were discarded.

bzz commented 6 years ago

Ok, .siva count of the Index is wrong above, here are updated numbers

$ zgrep -o "[0-9a-z]*\.siva" ~/.pga/latest.csv.gz | sort | uniq | wc -l
239807

$ hdfs dfs -ls -R hdfs://hdfs-namenode/pga/siva/latest | grep -c "\.siva$"
239807

~that means that after 2 runs of pga get there are still 17594 .siva files missing~

@vmarkovtsev One more try 🥇 and now it seems that actually, all the files from the index were downloaded on the second pga get round!

Should we document this result as current "best practice" of first approximation of download integrity verification?

As a user, I would very much prefer to use something more automated that also includes some archive integrity verification, etc

vmarkovtsev commented 6 years ago

@bzz This is great news! I am so happy you failed to download them two times and this is not an indexing issue!

campoy commented 6 years ago

I keep on thinking about having this dataset in a git repository. That would allow us to use git and all the tools around it rather than reimplementing them from scratch.

If anyone in engineering wants to play with the idea, that'd be awesome!

rporres commented 6 years ago

I keep on thinking about having this dataset in a git repository.

@mcuadros also proposed something similar in the past

bzz commented 6 years ago

From the logs of the 3rd pga get attempt

time="2018-06-13T00:19:34Z" level=warning msg="could not check md5 hashes for siva/latest/2f/2f8ec84fb873345973c7671f0bf455687b72b982.siva, comparing timestamps instead: could not fetch hash at siva/latest/2f/2f8ec84fb873345973c7671f0bf455687b72b982.siva.md5: 404 Not Found"
time="2018-06-13T00:19:38Z" level=warning msg="could not check md5 hashes for siva/latest/3b/3be46690f46f27f3e671de9a615d24d1554b9991.siva, comparing timestamps instead: could not fetch hash at siva/latest/3b/3be46690f46f27f3e671de9a615d24d1554b9991.siva.md5: 404 Not Found"
time="2018-06-13T00:19:39Z" level=warning msg="could not check md5 hashes for siva/latest/77/77138002cad60ffde2de45c01a7d72275cdc7e9a.siva, comparing timestamps instead: could not fetch hash at siva/latest/77/77138002cad60ffde2de45c01a7d72275cdc7e9a.siva.md5: 404 Not Found"
time="2018-06-13T00:19:40Z" level=warning msg="could not check md5 hashes for siva/latest/4c/4cbfd46e3d1020bfe7a87bc6dc8b2952100a7c16.siva, comparing timestamps instead: could not fetch hash at siva/latest/4c/4cbfd46e3d1020bfe7a87bc6dc8b2952100a7c16.siva.md5: 404 Not Found"
time="2018-06-13T00:19:43Z" level=warning msg="could not check md5 hashes for siva/latest/d6/d6c7789c349d3f6fe5eb5ce9ff0a9ee1901238cd.siva, comparing timestamps instead: could not fetch hash at siva/latest/d6/d6c7789c349d3f6fe5eb5ce9ff0a9ee1901238cd.siva.md5: 404 Not Found"
time="2018-06-13T00:19:44Z" level=warning msg="could not check md5 hashes for siva/latest/78/78f5c3123dde27170c47ad9e5c95f9f507d550cb.siva, comparing timestamps instead: could not fetch hash at siva/latest/78/78f5c3123dde27170c47ad9e5c95f9f507d550cb.siva.md5: 404 Not Found"
bzz commented 6 years ago

After another run of full PGA download:

More details in https://github.com/src-d/datasets/pull/69#issuecomment-398053657

bzz commented 6 years ago

Early next week will try to 🔥 https://github.com/smola/checksum-spark by @smola and report which of 2 runs are more complete

smola commented 6 years ago

@bzz I ran it yesterday, there's generated checksums for the /pga2 directory in the directory itself. I did not have time to check the results though. Note that with just 3 workers it takes a few hours to run.

smola commented 6 years ago

I've verified that /pga2 directory matches with reference md5 provided by @rporres, but there are 42662 missing files. Those that are present are correct.

It seems part of the missing files are actually not in the index, so it is expected they are not in the downloaded copy.

bzz commented 6 years ago

Those that are present are correct.

🎉 🙇 to #69

42662 missing files. It seems part of the missing files are actually not in the index, so it is expected they are not in the downloaded copy.

This was something @vmarkovtsev was worried about. Are all 42662 missing from the index?

smola commented 6 years ago

@bzz Not all of them are missing from the index, my guess is that part of them were missing just because of a crash in the download process?