Extract UAST from the HEAD of PGAv2

r0mainK commented 5 years ago

Context

As soon as Infra team has copied PGAv2 to the ML cluster, we will start using it often. Most of the time, we will be extracting UAST from the HEAD, and then do something. There is no reason to repeatedly query Gitbase to do this, so we should do it only once.

Task

Use GitBase to extract and store for all parsable files of the HEAD the UAST, repository name, file name and language. The storage format should be compatible with Spark so we can easily reuse it, hence it should probably be Parquet. The UASTs being relatively heavy, we should see if we can compress them further beforehand, see if LA team has any insight on this.

Checklist

[x] Ask LA if they have any ideas regarding UAST compression, possibly implement a tool to compress/decompress.
[x] Ask Infra to add distributed storage between Spark workers and Jupyter pods, issue
[x] Ask Vadim which kind of UAST should be retrieved (~native~, ~annotated~, semantic)
[x] Code the payload that will do the extraction
[x] Parse everything with aggressive parallelization. (Vadim)
[x] Parse the OOMed sivas with very conservative parallelization. (Vadim)
[x] Verify the result and find which files could not be parsed in each HEAD.
- [x] Create file from parquet Parquet with siva filename, HEAD UUID, file_list (Romain)
- [x] Create file with List of files in HEAD of each Siva (Vadim)
- [x] Compute success % (Romain)
[ ] (optional) Check the output can be used with Gemini once parquet input is added to it

r0mainK commented 5 years ago

So, after asking LA, it seems that they have discussed it and developed some prototypes, but they have nothing operational. Hence, we will not be compressing the UASTs, and will rely purely on Parquet for compression. They might work on it this quarter, but we shouldn't count on it in the near future.

Once the Spark cluster is usable, PGAv2 is ready, and I create the features for imports task, I will take care of this.

r0mainK commented 5 years ago

PGAv2 has been copied to the cluster, and Spark-Gitbase are usable. Currently I'm cleaning up the /user/repositories directory to only have PGA in it, it should take until tomorrow at the current speed, but then will try to do this after the imports task - so I can assess the problems of dealing with so much data.

r0mainK commented 5 years ago

After talking to Vadim about progress on this, decided to start doing it now. Going to test out different schemes for compression via parquet, as well as see if I'm able to scale things with spark and gitbase-spark-connector-e.

EDIT: Ok, so best scheme is no compression when writing the parquet, then tar/gzip the resulting parquet. It achieves a compression rate almost ten times better then when using the gzip by block that parquet uses.

r0mainK commented 5 years ago

Okay, so this answer from Miguel and the following messages discussions with Maartje made quite clear that:

There needs to be multiple instances of GitBase, pointing to subsets of the data, if you want to distribute queries.
When working with large amounts of repositories like in the case of PGA, using only one instance will not only be slow, but also not work. In my case, I have not been able to query Gitbase successfully when not limiting to a relatively low number the amount of rows returned: 100k is the current max, at 1M the server crashes.
While a setup with multiple Gitbase instances exists on the pipeline cluster, given the differences in size of both cluster (32 nodes vs 4), reproducing it on ML is not going to yield much improvements.

This means that:

When querying Gitbase on our cluster, nothing can be distributed until the data is in one of the workers, which makes it pretty useless, if not completely, unless working on much smaller datasets, or limiting the amount of rows returned.

So in order to do this task, here is the best plan I can come up with:

Ask infra team for pipeline access
Use pipeline cluster to do this task - there may be the same issue with volumes mounting as on the ML cluster (see link below)
Move the parquet files from the pipeline cluster to the ML cluster, location depending on this issue
Use the index to add a repository_name column to the parquet file, as currently this information is not included and some repositories are split across multiple siva files.

r0mainK commented 5 years ago

Had a meeting with Alex, updated checklist in consequence

vmarkovtsev commented 5 years ago

Ask Vadim which kind of UAST should be retrieved (native, annotated, semantic)

Semantic.

vmarkovtsev commented 5 years ago

We are meeting with Máximo tomorrow to discuss the problems, he has suggested to betray gitbase in favor of custom Go solution. Let's see.

vmarkovtsev commented 5 years ago

I coded https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga2uast to do the task. There are 3 stages:

[x] Parse everything with aggressive parallelization. (Vadim)
[x] Parse the OOMed sivas with very conservative parallelization. (Vadim)
[x] Verify the result and find which files could not be parsed in each HEAD. (Romain)

vmarkovtsev commented 5 years ago

We used the 7 nodes given by the infra, mining-5 to mining-11.

Size of the result from the first stage: 4.5TB. Number of OOMed sivas: 211.

Current progress of the second stage: 17/211 in 3 days. This means ETA is 34 days. However, I am using a single node. Once the DockerHub mining is over I will be able to spread the load over all the 11 nodes.

r0mainK commented 5 years ago

Just finished the sanity check on the Siva files that were parsed via aggressive parallelization. My workflow was the following:

I tried to load and read contents of a single parquet: success
I tried to load and read contents of a whole subdirectory (00): success
I tried to load and read contents of all PGA, to compute a CSV with metadata (subrepo, siva hash, repo uuid, list of files): failure

The job failed due to the presence of unreadable parquet files. The error was triggered if I read the specific parquet, or if I tried to count the rows of the DataFrame if loading from a subdir. So I went through each subdir, loading then counting the Dataframe, and if an error was caught I loaded each file in the subdir to find the corrupt ones. Once I identified all the files, I moved then to /spark_storage/pga.v2.corrupted and saved their names in /user/r0maink/sanity/corrupt_pq_1.txt.

Once that was done, I repeated the previous step, with the same result but due to a different error. This one was also due to corrupt parquet files, but simply loading/counting did not trigger the error, I had to actually try to use the contents, for instance by collecting the rows. So I repeated the error finding process, and then moved them to /spark_storage/pga.v2.corrupted_2 and saved their names in /user/r0maink/sanity/corrupt_pq_2.txt.

Once that was done, I repeated the previous step, with the same result but due to a JavaHeapSpace error at about 20-25% of the progress. As I had not optimized the query, knew Infra was gonna work on the cluster today, and did not require working on more then one subdir at a time to compute the CSV, I ran the job on each subdir independently. It just finished after ~12h (this does not really reflected perfs to be expected as I did not try to optimize, group multiple subdirs which would be possible, etc, etc).

Anyway, the CSV weighs 2.02 GB and has 216,752 lines, ie the non-corrupt parquet files contain that much (siva file, repo uuid, file_list) triplets. The total number of files is 32,851,320. By the way, I removed files that had a null UAST, so there might be more in the parquet files, but simply with empty UASTs.

Here are some stats about the corrupt files (as you can see, they were all concentrated in the same 7 sub directories). Given their number, I think it's worth trying to parse the Siva files once more to see if the error was due to the process or something else (you can take the listings on the ML cluster from the locations given above).

	# files	# non-corrupt files	# corrupt 1	# corrupt 2
all subdirs	203,870	203,736 (99.93%)	30 (0.01%)	104 (0.05 %)
subdir 28	825	804 (97.45 %)	4 (0.48 %)	17 (2.06 %)
subdir 2a	806	781 (96.90 %)	5 (0.62 %)	20 (2.48 %)
subdir 2c	828	804 (97.10 %)	3 (0.36 %)	21 (2.54 %)
subdir 2d	847	818 (96.58 %)	6 (0.71 %)	23 (2.72 %)
subdir 2e	810	777 (95.93 %)	10 (1.23 %)	23 (2.84 %)
subdir 2f	850	849 (99.88 %)	1 (0.12 %)	0 (0.00 %)
subdir 65	783	782 (99.87 %)	1 (0.13 %)	0 (0.00 %)

	size	non-corrupt size	corrupt 1 size	corrupt 2 size
all subdirs	4.861 TB	4.811 TB (98.97 %)	6.18 GB (0.13 %)	43.98 GB (0.9 %)
subdir 28	16.16 GB	11.19 GB (69.25 %)	1.51 GB (9.37 %)	3.45 GB (21.38 %)
subdir 2a	15.76 GB	10.87 GB (68.97 %)	1.92 GB (12.17 %)	2.97 GB (18.86 %)
subdir 2c	24.62 GB	11.30 GB (45.91 %)	85 MB (0.35 %)	13.23 GB (53.74 %)
subdir 2d	23.51 GB	10.36 GB (44.06 %)	392.95 MB (1.67 %)	12.76 GB (54.27 %)
subdir 2e	19.66 GB	6.54 GB (33.29 %)	1.54 GB (7.86 %)	11.57 GB (58.85 %)
subdir 2f	17.93 GB	17.20 GB (95.97 %)	721 MB (4.03 %)	0 B (0.00 %)
subdir 65	15.91 GB	15.91 GB (99.99 %)	~0 B (0.01%)	0 B (0.00 %)

vmarkovtsev commented 5 years ago

Great report @r0mainK

This means that I need to re-process a small fraction of files which are corrupted.

r0mainK commented 5 years ago

Thks, yep the listings are in /user/r0maink/sanity/corrupt_pq_1.txt and /user/r0maink/sanity/corrupt_pq_2.txt, if you can put them in a separate directory under /spark_storage/pg1.v2.v2 or something so I can process them directly it would be great. As it's only ~50GB it should not take too long - and hopefully the error will not repeat itself.

vmarkovtsev commented 5 years ago

The new files were generated and overwritten over the corrupted ones. @r0mainK Please test once again, there shall be no corruptions this time.

I had to write them directly, unfortunately.

r0mainK commented 5 years ago

@vmarkovtsev no problem, anyway I did not know this but when you call a repartition on a DataFrame, it turns out you can't use the built in input_file_name so the 2 colums for the subdir and siva fp were empty in each row -_-"

So I launched the test once again, will post results once I have them.

r0mainK commented 5 years ago

Okay, the job finished in 5h30 (the repartition really was a dumb idea, removing it halved process time), I checked the CSV file this time it's good. It is slightly bigger - especially in term of # files as could be expected from the size of the old corrupted files:

2.06 GB (instead of 2.02 GB)
216,922 lines (instead of 216,752)
33,315,834 files (instead of 32,851,320)

vmarkovtsev commented 5 years ago

I have created https://github.com/src-d/datasets/pull/158 to list the siva HEADs.

I launched the listing with 32 goroutines on the ML cluster, it digested 17% in 18 hours. ETA 4 days. I will have to interrupt it on Wednesday though.

vmarkovtsev commented 5 years ago

I parsed the OOMed sivas. I was able to process 204/211 files. The results are merged with the main dataset.

@r0mainK it is time to run the check again!

Regarding the listing, it is 80% complete. ETA Friday.

r0mainK commented 5 years ago

Awesome, I've relaunched the process with the same config, let's see how it goes - I expect it to be done by tomorrow, unless something goes horribly wrong :crossed_fingers:

r0mainK commented 5 years ago

@vmarkovtsev extraction completed ! It ran in 6 hours 8 mins, so a bit more then last time. Unfortunately, it was not 100% of all files, there are currently 835 siva/parquet files missing from the old index. I cross checked, and it seems all the missing files where from the 00 subdirectory, which contains a bit over that amount of files, surprisingly.

So I tried to read and count it, and it indeed caused an error. I inspected the directory, there are 3 new files:

-rw-rw-r--. 1 1004 1004  1.8G Aug  3 09:05 00824011103c689db12451a6f73f84b57a6d05e0.parquet
-rw-rw-r--. 1 1004 1004  3.0G Aug  3 08:22 0079cc5fa5b7d13fd201fbae276b01f7f27f8dc9.parquet
-rw-rw-r--. 1 1004 1004   17G Aug  3 02:31 0067e598fa2532b9a914984456d6bff752a0cfd3.parquet

I loaded each one individually and tried to collect them, and you guessed it, the 2 first ones did not cause any error, it was the third 17GB one that did and caused the whole subdir to crash. So I moved that single file to /spark_storage/the_bad_siva/ and afterwards in worked. Anyway, for the sake of comparing true run times (and since we won't have the listing until Friday in all cases) I'm gonna relaunch the whole process, it should be over by this evening - and I'll add final metrics here.

vmarkovtsev commented 5 years ago

It ran in 6 hours 8 mins

I am still listing files in PGA, so the FS performance was degraded.

I have renamed /spark_storage/the_bad_siva/ to /spark_storage/bad_uasts/.

I didn't fully get 835. You are saying that there are 835 siva files under 00 which are in the index but are not extracted, right?

r0mainK commented 5 years ago

I didn't fully get 835. You are saying that there are 835 siva files under 00 which are in the index but are not extracted, right?

@vmarkovtsev no what I meant was that there were 835 missing files in the new index, that where already present in the previous index. This was due to the fact that there was one corrupt file added to the00 subdir that made the job on that subdir completely fail, thus making the 835 files, plus the new ones, not appear in the in new index. But all of those files were extracted successfully, Spark just failed to process them due to that one bad siva.

vmarkovtsev commented 5 years ago

The file listing is at:

/spark_storage/files_pga.v2

However, the listing has 148230 files compared to 204069 uasts. Weird. I have to re-launch the listing on the missing files.

I renamed /spark_storage/uast_pga.v2 to /spark_storage/uasts_pga.v2

vmarkovtsev commented 5 years ago

@r0mainK The listing is finally over! 205546 files.

/spark_storage/files_pga.v2

The structure is flat, there are no subdirs.

vmarkovtsev commented 5 years ago

I set the access for all dirs and subdirs in /spark_storage to 555. We've got

sivas_pga.v2 - the original PGA
uasts_pga.v2 - UASTs
files_pga.v2 - listing
bad_siva - temporary, to be deleted

r0mainK commented 5 years ago

I have made a check on the processed index, the 14 subdir was not processed - the spark jobs on it failed with an OOM Java error. I traced back the origin of the problem to a single files (14/147288108757caed09e0c65d9ec098b821129eba.parquet) which I added to bad_uasts directory. Relaunching processing. Once it is done, I will finish up this task.

r0mainK commented 5 years ago

Okay so I finished the extraction, without any errors :100: I did find there was an issue with parquet file 61/614fa43723122e2a8318d65104991163b9915d72.parquet (it was empty) so I moved it to the bad_uasts folder.

As expected, the CSV file is a bit bigger (2.24 GB), and now contains 218,081 lines (UUID-Siva/Parquet files pairs), and a total of 36,109,756 files across all repos. This means that the stragglers added increased the file count by about 8.3 %. Also, it seems there was some duplication across PGA (most probably some files were processed twice in different UUIDS), as I found distinct 35,991,897 files over 218,023 distinct UUIDs.

I then extracted from the theoretical listings for each repo that Vadim provided the list of files per UUID and, then did the sanity check. Although Vadim has warned the theoretical listings was incomplete, I still found more distinct UUIDs (219,610) and a total of 40,285,913 distinct files.

Anyways here are the results of the sanity check:

	file count	% of union	uuid count	% of union
union of both listings	40,603,063	100 %	219,610	100 %
intersection of both listings	35,674,747	87.86 %	218,023	99.28 %
only in parquet listing	317,150	0.78 %	0	0 %
only in theoretical listing	4,611,166	11.36 %	1,587	0.72 %

I also looked into more granular results:

	1st quarter	Mean	Median	3rd quarter
% of extracted files per UUID	80 %	86 %	92 %	99 %

As can be seen, although overall we extracted ~88% of files, the extraction rate varies a lot depending on the repo. AS you can see on the scatter plot below, there seems to be a positive correlation between the number of files in the repo, and the amount that are extracted, but nothing much more. I suspect if we looked at these rates per languages we would probably find most errors come from specific driver.

download

vmarkovtsev commented 5 years ago

Awesome!

Is it possible to study per language, also to gather repo/paths of files which could not be extracted?

r0mainK commented 5 years ago

@vmarkovtsev was about to edit my message above :p So yeah that's on me actually, I think I had mentionned in meetings it would be useful to have the language of each file in the parquet files, but I forgot to put it down by writing in this issue. So currently, I could only do this using Regexp with the filenames. I already had some experience doing this way back for the apollo blogpost, when we had a similar albeit much smaller dataset, and it was pretty bad.

I think we should just rerun the listing and add this information, if it is possible ? Getting the bytesize of each file would be interesting as well I think. If the processing is as efficient as the first time, we will miss less then 1% of files, and we can get that number further down with regexp. What do you think ?

Also yes, I can create a CSV with the following schema if you want, using the CSVs I've created and the index: subdir,siva_hash,repo_uuid,repo_name,file_name

vmarkovtsev commented 5 years ago

OK, I will edit the code and re-launch the listing tomorrow.

vmarkovtsev commented 5 years ago

I launched the new listing.

vmarkovtsev commented 5 years ago

It is funny that we've got 317150 files only in the UASTs. I hope that this time a clean run will be flawless.

r0mainK commented 5 years ago

Yeah, it's surprising, especially as all of those files are in repos that were listed at least in part.

vmarkovtsev commented 5 years ago

Writing this while I remember. An important detail how we should calculate the success rate: it must be calculated on the intersection of \<sivas listed> x \<uasts extracted>. In other words, we should ignore listed siva files which were not at least partially extracted.

The reason is that a failure to extract a siva is not necessary due to Babelfish: the file can be just too big for the given hardware and algorithm.

r0mainK commented 5 years ago

Ah yeah, nice point I had not considered it :100:

I just recomputed the values, by defining the real union with your definition, ie files in sivas where at least one file was extracted. I found 38,500,270 files, which is 94.82 % of the previous union. This means that:

we actually were able to extract 92.66 % of files, not 87.86 % :+1:
we failed to process ~ 5.2 % of the files, which were from 0.72 % of the repos as indicated before. We would need to look more into detail, but yeah guessing those files were just too big.

vmarkovtsev commented 5 years ago

@r0mainK The new listing has finished, it is in the same place. 205283 files. As you see, this number is less than the previous one and I am going crazy to find out why. I need to find the missing names and re-launch on them.

vmarkovtsev commented 5 years ago

Update: I found the missing 801 files, listed them and put to the output directory. Now the overall count is 206084 and we should not have weird files that are present only in the uasts.

Please run the stats per language!

r0mainK commented 5 years ago

@vmarkovtsev running the stats noew, however still missing 97,049 files in the listing. They come from 402 siva files, all of which were in the listing (save for the repos). No idea why they were not listed. Anyway, the exact list is here: missing.txt (format is subdir,siva_hash,repo_uuid,file_name)

I will update today with language stats

EDIT: okay so this might actually be a bug in my reading of the CSV files (newlines in some of the filenames, just loving it) EDIT2: yep its the fucking newlines. If there are still any missing will tell you. EDIT3: Ok ffs this is gonna take me a bit more time, will update after the retreat. Some people are just goddamn dumb, naming files with comas, newlines and quotations marks, which fucked up my CSV. Im gonna recreate the listings taking that into account

r0mainK commented 5 years ago

OK so this is is the final report (hopefully). In the following, I'll be calling extracted rate the ratio of files extracted over all files, and success rate the ratio of files extracted over files in repositories at least partially extracted (ie where Babelfish errored, not something else). I'll also be calling theoretical the listing Vadim created from crawling PGA, and processed the listing I extracted from the Parquet files.

preliminary

First off here are various counts of interest for both listings. As planned, I did not found files in the processed listing that were not in the theoretical listing, however I did found a small amount of files were duplicated in both listings (ie had the same UUID and file path). I do not know why that was the case, I'm guessing something in PGA, or the way we crawled it.

	# of sivas	# of repos	# of files	# of distinct files	% of duplicates
processed	204,067	218,023	36,162,330	35,991,340	0.5 %
theoretical	206,084	220,174	40,971,787	40,829,244	0.3 %

In the following I'll be computing stats over the distinct files.

extraction and success rate

So overall, a few things to note:

As could be expected, extraction rates for Siva files and Repos are the same
Babelfish caused us to miss a bit under 6% of files in processable Siva files which accounted for 18 % of the data when measured in bytes
Overall we were unable to extract only 12 % of files, however these contained about 45% of the data when measured in bytes

	extraction rate	success rate
sivas	99.02 %	100 %
repos	99.02 %	100 %
files	88.15 %	94.26 %
bytes	65.37 %	82.12 %

language specific analysis

I ran the same analysis per language as you asked. As you can see, results are clearly unequal. Looking first at files, we can see 3 groups appear:

Go, C#, Java and Ruby all have extraction rates above 96.5%, and almost all of the errors were due to Babelfish as can be seen when compared to the success rate.
Python, PHP and Shell all are in the 90 % range (Python is a bit above, the others a bit below), most errors are still due to Babelfish
finally, both C++ and JavaScript have about 80 % extraction rate, however in the case of JavaScript the vast majority of failures is not due to Babelfish given the high success rate, which is not the case for the C++ drive, which has the lowest success rate as well as extraction rate (as expected though).

	file count	file extraction rate	file success rate
Go	4,126,578	99.88 %	99.88 %
Python	2,994,169	89.70 %	91.93 %
C++	8,726,368	80.41 %	86.66 %
C#	2,379,754	98.99 %	99.13 %
Java	6,985,742	96.85 %	98.69 %
JavaScript	10,466,131	80.54 %	97.14 %
Ruby	1,143,654	96.70 %	96.76 %
PHP	2,888,395	87.64 %	87.94 %
Shell	1,118,453	87.54 %	88.42 %

Looking now at bytes, we see the same trend as before, ie both rates are lower as the larger files are the ones causing problems. However, there are still some things to note:

for half of the languages, both rates are roughly the same. this is not true for Python, C++ and Java, where the sucess rates are a bit higher, and especially for JavaScript where we the success rate skyrockets by 33 %. This means that there is indeed a strong correlation between file size and Babelfish error. I think this is due to the fact that the larger the file, the higher chance the file is bugged, or will cause a bug. so nothing surprising but still.
C++, JavaScript, PHP and Shell are the languages where there is the most disparity between rates for files and bytes, ie the most affected by Babelfish errors and/or siav files being excluded. This is particularly true for Shell, JS and to a lesser extent C# and PHP

	byte size	byte extraction rate	byte success rate
Go	56.48 GB	96.12 %	96.13 %
Python	22.84 GB	84.36 %	86.28 %
C++	22.84 GB	63.69 %	67.19 %
C#	15.43 GB	93.12 %	93.32 %
Java	42.19 GB	95.26 %	98.94 %
JavaScript	227.68 GB	50.09 %	83.06 %
Ruby	3.42 GB	91.56 %	91.72 %
PHP	15.55 GB	71.92 %	72.10 %
Shell	8.26 GB	25.97 %	26.26 %

repo specific analysis

I plotted the same heatmap as in one of the posts above, it seems the strange distribution was due to errors in my code. This time, I found no correlation between the number of files in a given repository, and the fraction of extracted files in that repository. As you can see from the histograms below, most repos were fully extracted (61 % of them), and the ratio of extracted files per repo actually follow an exponential law, further indicating that we're looking at sporadic events. I did not look per language, I'm guessing we'd find the same distribution but more/less pronounced depending on the driver.

conclusion

Anyway, I think I've covered more or less everything. If you want me to go more into detail no problem. By the way, here is a zipped CSV with the subdir, siva, repo_uuid, file_path of the 2,191,222 files that apparently caused a Babelfish error (I remove newlines from file paths).

vmarkovtsev commented 5 years ago

@r0mainK Is it possible to add the language to that CSV file with bblfsh errors?

repo_uuid is the repo name, correct?

r0mainK commented 5 years ago

@vmarkovtsev Added language and repository name. Left the PGA metadata in case, anyway now schema is subdir, siva, repo_uuid, repo_name, lang, file_path.

The zipped CSV can be found here

src-d / ml-backlog