src-d / ml-backlog

Issues belonging to source{d}'s Machine Learning team which cannot be related to a specific repository.
0 stars 3 forks source link

Extract UAST from the HEAD of PGAv2 #74

Closed r0mainK closed 5 years ago

r0mainK commented 5 years ago

Context

As soon as Infra team has copied PGAv2 to the ML cluster, we will start using it often. Most of the time, we will be extracting UAST from the HEAD, and then do something. There is no reason to repeatedly query Gitbase to do this, so we should do it only once.

Task

Use GitBase to extract and store for all parsable files of the HEAD the UAST, repository name, file name and language. The storage format should be compatible with Spark so we can easily reuse it, hence it should probably be Parquet. The UASTs being relatively heavy, we should see if we can compress them further beforehand, see if LA team has any insight on this.

Checklist

r0mainK commented 5 years ago

So, after asking LA, it seems that they have discussed it and developed some prototypes, but they have nothing operational. Hence, we will not be compressing the UASTs, and will rely purely on Parquet for compression. They might work on it this quarter, but we shouldn't count on it in the near future.

Once the Spark cluster is usable, PGAv2 is ready, and I create the features for imports task, I will take care of this.

r0mainK commented 5 years ago

PGAv2 has been copied to the cluster, and Spark-Gitbase are usable. Currently I'm cleaning up the /user/repositories directory to only have PGA in it, it should take until tomorrow at the current speed, but then will try to do this after the imports task - so I can assess the problems of dealing with so much data.

r0mainK commented 5 years ago

After talking to Vadim about progress on this, decided to start doing it now. Going to test out different schemes for compression via parquet, as well as see if I'm able to scale things with spark and gitbase-spark-connector-e.

EDIT: Ok, so best scheme is no compression when writing the parquet, then tar/gzip the resulting parquet. It achieves a compression rate almost ten times better then when using the gzip by block that parquet uses.

r0mainK commented 5 years ago

Okay, so this answer from Miguel and the following messages discussions with Maartje made quite clear that:

This means that:

So in order to do this task, here is the best plan I can come up with:

  1. Ask infra team for pipeline access
  2. Use pipeline cluster to do this task - there may be the same issue with volumes mounting as on the ML cluster (see link below)
  3. Move the parquet files from the pipeline cluster to the ML cluster, location depending on this issue
  4. Use the index to add a repository_name column to the parquet file, as currently this information is not included and some repositories are split across multiple siva files.
r0mainK commented 5 years ago

Had a meeting with Alex, updated checklist in consequence

vmarkovtsev commented 5 years ago

Ask Vadim which kind of UAST should be retrieved (native, annotated, semantic)

Semantic.

vmarkovtsev commented 5 years ago

We are meeting with Máximo tomorrow to discuss the problems, he has suggested to betray gitbase in favor of custom Go solution. Let's see.

vmarkovtsev commented 5 years ago

I coded https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga2uast to do the task. There are 3 stages:

vmarkovtsev commented 5 years ago

We used the 7 nodes given by the infra, mining-5 to mining-11.

Size of the result from the first stage: 4.5TB. Number of OOMed sivas: 211.

Current progress of the second stage: 17/211 in 3 days. This means ETA is 34 days. However, I am using a single node. Once the DockerHub mining is over I will be able to spread the load over all the 11 nodes.

r0mainK commented 5 years ago

Just finished the sanity check on the Siva files that were parsed via aggressive parallelization. My workflow was the following:

The job failed due to the presence of unreadable parquet files. The error was triggered if I read the specific parquet, or if I tried to count the rows of the DataFrame if loading from a subdir. So I went through each subdir, loading then counting the Dataframe, and if an error was caught I loaded each file in the subdir to find the corrupt ones. Once I identified all the files, I moved then to /spark_storage/pga.v2.corrupted and saved their names in /user/r0maink/sanity/corrupt_pq_1.txt.

Once that was done, I repeated the previous step, with the same result but due to a different error. This one was also due to corrupt parquet files, but simply loading/counting did not trigger the error, I had to actually try to use the contents, for instance by collecting the rows. So I repeated the error finding process, and then moved them to /spark_storage/pga.v2.corrupted_2 and saved their names in /user/r0maink/sanity/corrupt_pq_2.txt.

Once that was done, I repeated the previous step, with the same result but due to a JavaHeapSpace error at about 20-25% of the progress. As I had not optimized the query, knew Infra was gonna work on the cluster today, and did not require working on more then one subdir at a time to compute the CSV, I ran the job on each subdir independently. It just finished after ~12h (this does not really reflected perfs to be expected as I did not try to optimize, group multiple subdirs which would be possible, etc, etc).

Anyway, the CSV weighs 2.02 GB and has 216,752 lines, ie the non-corrupt parquet files contain that much (siva file, repo uuid, file_list) triplets. The total number of files is 32,851,320. By the way, I removed files that had a null UAST, so there might be more in the parquet files, but simply with empty UASTs.

Here are some stats about the corrupt files (as you can see, they were all concentrated in the same 7 sub directories). Given their number, I think it's worth trying to parse the Siva files once more to see if the error was due to the process or something else (you can take the listings on the ML cluster from the locations given above).

# files # non-corrupt files # corrupt 1 # corrupt 2
all subdirs 203,870 203,736 (99.93%) 30 (0.01%) 104 (0.05 %)
subdir 28 825 804 (97.45 %) 4 (0.48 %) 17 (2.06 %)
subdir 2a 806 781 (96.90 %) 5 (0.62 %) 20 (2.48 %)
subdir 2c 828 804 (97.10 %) 3 (0.36 %) 21 (2.54 %)
subdir 2d 847 818 (96.58 %) 6 (0.71 %) 23 (2.72 %)
subdir 2e 810 777 (95.93 %) 10 (1.23 %) 23 (2.84 %)
subdir 2f 850 849 (99.88 %) 1 (0.12 %) 0 (0.00 %)
subdir 65 783 782 (99.87 %) 1 (0.13 %) 0 (0.00 %)
size non-corrupt size corrupt 1 size corrupt 2 size
all subdirs 4.861 TB 4.811 TB (98.97 %) 6.18 GB (0.13 %) 43.98 GB (0.9 %)
subdir 28 16.16 GB 11.19 GB (69.25 %) 1.51 GB (9.37 %) 3.45 GB (21.38 %)
subdir 2a 15.76 GB 10.87 GB (68.97 %) 1.92 GB (12.17 %) 2.97 GB (18.86 %)
subdir 2c 24.62 GB 11.30 GB (45.91 %) 85 MB (0.35 %) 13.23 GB (53.74 %)
subdir 2d 23.51 GB 10.36 GB (44.06 %) 392.95 MB (1.67 %) 12.76 GB (54.27 %)
subdir 2e 19.66 GB 6.54 GB (33.29 %) 1.54 GB (7.86 %) 11.57 GB (58.85 %)
subdir 2f 17.93 GB 17.20 GB (95.97 %) 721 MB (4.03 %) 0 B (0.00 %)
subdir 65 15.91 GB 15.91 GB (99.99 %) ~0 B (0.01%) 0 B (0.00 %)
vmarkovtsev commented 5 years ago

Great report @r0mainK

This means that I need to re-process a small fraction of files which are corrupted.

r0mainK commented 5 years ago

Thks, yep the listings are in /user/r0maink/sanity/corrupt_pq_1.txt and /user/r0maink/sanity/corrupt_pq_2.txt, if you can put them in a separate directory under /spark_storage/pg1.v2.v2 or something so I can process them directly it would be great. As it's only ~50GB it should not take too long - and hopefully the error will not repeat itself.

vmarkovtsev commented 5 years ago

The new files were generated and overwritten over the corrupted ones. @r0mainK Please test once again, there shall be no corruptions this time.

I had to write them directly, unfortunately.

r0mainK commented 5 years ago

@vmarkovtsev no problem, anyway I did not know this but when you call a repartition on a DataFrame, it turns out you can't use the built in input_file_name so the 2 colums for the subdir and siva fp were empty in each row -_-"

So I launched the test once again, will post results once I have them.

r0mainK commented 5 years ago

Okay, the job finished in 5h30 (the repartition really was a dumb idea, removing it halved process time), I checked the CSV file this time it's good. It is slightly bigger - especially in term of # files as could be expected from the size of the old corrupted files:

vmarkovtsev commented 5 years ago

I have created https://github.com/src-d/datasets/pull/158 to list the siva HEADs.

I launched the listing with 32 goroutines on the ML cluster, it digested 17% in 18 hours. ETA 4 days. I will have to interrupt it on Wednesday though.

vmarkovtsev commented 5 years ago

I parsed the OOMed sivas. I was able to process 204/211 files. The results are merged with the main dataset.

@r0mainK it is time to run the check again!

Regarding the listing, it is 80% complete. ETA Friday.

r0mainK commented 5 years ago

Awesome, I've relaunched the process with the same config, let's see how it goes - I expect it to be done by tomorrow, unless something goes horribly wrong :crossed_fingers:

r0mainK commented 5 years ago

@vmarkovtsev extraction completed ! It ran in 6 hours 8 mins, so a bit more then last time. Unfortunately, it was not 100% of all files, there are currently 835 siva/parquet files missing from the old index. I cross checked, and it seems all the missing files where from the 00 subdirectory, which contains a bit over that amount of files, surprisingly.

So I tried to read and count it, and it indeed caused an error. I inspected the directory, there are 3 new files:

-rw-rw-r--. 1 1004 1004  1.8G Aug  3 09:05 00824011103c689db12451a6f73f84b57a6d05e0.parquet
-rw-rw-r--. 1 1004 1004  3.0G Aug  3 08:22 0079cc5fa5b7d13fd201fbae276b01f7f27f8dc9.parquet
-rw-rw-r--. 1 1004 1004   17G Aug  3 02:31 0067e598fa2532b9a914984456d6bff752a0cfd3.parquet

I loaded each one individually and tried to collect them, and you guessed it, the 2 first ones did not cause any error, it was the third 17GB one that did and caused the whole subdir to crash. So I moved that single file to /spark_storage/the_bad_siva/ and afterwards in worked. Anyway, for the sake of comparing true run times (and since we won't have the listing until Friday in all cases) I'm gonna relaunch the whole process, it should be over by this evening - and I'll add final metrics here.

vmarkovtsev commented 5 years ago

It ran in 6 hours 8 mins

I am still listing files in PGA, so the FS performance was degraded.

I have renamed /spark_storage/the_bad_siva/ to /spark_storage/bad_uasts/.

I didn't fully get 835. You are saying that there are 835 siva files under 00 which are in the index but are not extracted, right?

r0mainK commented 5 years ago

I didn't fully get 835. You are saying that there are 835 siva files under 00 which are in the index but are not extracted, right?

@vmarkovtsev no what I meant was that there were 835 missing files in the new index, that where already present in the previous index. This was due to the fact that there was one corrupt file added to the00 subdir that made the job on that subdir completely fail, thus making the 835 files, plus the new ones, not appear in the in new index. But all of those files were extracted successfully, Spark just failed to process them due to that one bad siva.

vmarkovtsev commented 5 years ago

The file listing is at:

/spark_storage/files_pga.v2

However, the listing has 148230 files compared to 204069 uasts. Weird. I have to re-launch the listing on the missing files.

I renamed /spark_storage/uast_pga.v2 to /spark_storage/uasts_pga.v2

vmarkovtsev commented 5 years ago

@r0mainK The listing is finally over! 205546 files.

/spark_storage/files_pga.v2

The structure is flat, there are no subdirs.

vmarkovtsev commented 5 years ago

I set the access for all dirs and subdirs in /spark_storage to 555. We've got

r0mainK commented 5 years ago

I have made a check on the processed index, the 14 subdir was not processed - the spark jobs on it failed with an OOM Java error. I traced back the origin of the problem to a single files (14/147288108757caed09e0c65d9ec098b821129eba.parquet) which I added to bad_uasts directory. Relaunching processing. Once it is done, I will finish up this task.

r0mainK commented 5 years ago

Okay so I finished the extraction, without any errors :100: I did find there was an issue with parquet file 61/614fa43723122e2a8318d65104991163b9915d72.parquet (it was empty) so I moved it to the bad_uasts folder.

As expected, the CSV file is a bit bigger (2.24 GB), and now contains 218,081 lines (UUID-Siva/Parquet files pairs), and a total of 36,109,756 files across all repos. This means that the stragglers added increased the file count by about 8.3 %. Also, it seems there was some duplication across PGA (most probably some files were processed twice in different UUIDS), as I found distinct 35,991,897 files over 218,023 distinct UUIDs.

I then extracted from the theoretical listings for each repo that Vadim provided the list of files per UUID and, then did the sanity check. Although Vadim has warned the theoretical listings was incomplete, I still found more distinct UUIDs (219,610) and a total of 40,285,913 distinct files.

Anyways here are the results of the sanity check:

file count % of union uuid count % of union
union of both listings 40,603,063 100 % 219,610 100 %
intersection of both listings 35,674,747 87.86 % 218,023 99.28 %
only in parquet listing 317,150 0.78 % 0 0 %
only in theoretical listing 4,611,166 11.36 % 1,587 0.72 %

I also looked into more granular results:

1st quarter Mean Median 3rd quarter
% of extracted files per UUID 80 % 86 % 92 % 99 %

As can be seen, although overall we extracted ~88% of files, the extraction rate varies a lot depending on the repo. AS you can see on the scatter plot below, there seems to be a positive correlation between the number of files in the repo, and the amount that are extracted, but nothing much more. I suspect if we looked at these rates per languages we would probably find most errors come from specific driver.

download

vmarkovtsev commented 5 years ago

Awesome!

Is it possible to study per language, also to gather repo/paths of files which could not be extracted?

r0mainK commented 5 years ago

@vmarkovtsev was about to edit my message above :p So yeah that's on me actually, I think I had mentionned in meetings it would be useful to have the language of each file in the parquet files, but I forgot to put it down by writing in this issue. So currently, I could only do this using Regexp with the filenames. I already had some experience doing this way back for the apollo blogpost, when we had a similar albeit much smaller dataset, and it was pretty bad.

I think we should just rerun the listing and add this information, if it is possible ? Getting the bytesize of each file would be interesting as well I think. If the processing is as efficient as the first time, we will miss less then 1% of files, and we can get that number further down with regexp. What do you think ?

Also yes, I can create a CSV with the following schema if you want, using the CSVs I've created and the index: subdir,siva_hash,repo_uuid,repo_name,file_name

vmarkovtsev commented 5 years ago

OK, I will edit the code and re-launch the listing tomorrow.

vmarkovtsev commented 5 years ago

I launched the new listing.

vmarkovtsev commented 5 years ago

It is funny that we've got 317150 files only in the UASTs. I hope that this time a clean run will be flawless.

r0mainK commented 5 years ago

Yeah, it's surprising, especially as all of those files are in repos that were listed at least in part.

vmarkovtsev commented 5 years ago

Writing this while I remember. An important detail how we should calculate the success rate: it must be calculated on the intersection of \<sivas listed> x \<uasts extracted>. In other words, we should ignore listed siva files which were not at least partially extracted.

The reason is that a failure to extract a siva is not necessary due to Babelfish: the file can be just too big for the given hardware and algorithm.

r0mainK commented 5 years ago

Ah yeah, nice point I had not considered it :100:

I just recomputed the values, by defining the real union with your definition, ie files in sivas where at least one file was extracted. I found 38,500,270 files, which is 94.82 % of the previous union. This means that:

vmarkovtsev commented 5 years ago

@r0mainK The new listing has finished, it is in the same place. 205283 files. As you see, this number is less than the previous one and I am going crazy to find out why. I need to find the missing names and re-launch on them.

vmarkovtsev commented 5 years ago

Update: I found the missing 801 files, listed them and put to the output directory. Now the overall count is 206084 and we should not have weird files that are present only in the uasts.

Please run the stats per language!

r0mainK commented 5 years ago

@vmarkovtsev running the stats noew, however still missing 97,049 files in the listing. They come from 402 siva files, all of which were in the listing (save for the repos). No idea why they were not listed. Anyway, the exact list is here: missing.txt (format is subdir,siva_hash,repo_uuid,file_name)

I will update today with language stats

EDIT: okay so this might actually be a bug in my reading of the CSV files (newlines in some of the filenames, just loving it) EDIT2: yep its the fucking newlines. If there are still any missing will tell you. EDIT3: Ok ffs this is gonna take me a bit more time, will update after the retreat. Some people are just goddamn dumb, naming files with comas, newlines and quotations marks, which fucked up my CSV. Im gonna recreate the listings taking that into account

r0mainK commented 5 years ago

OK so this is is the final report (hopefully). In the following, I'll be calling extracted rate the ratio of files extracted over all files, and success rate the ratio of files extracted over files in repositories at least partially extracted (ie where Babelfish errored, not something else). I'll also be calling theoretical the listing Vadim created from crawling PGA, and processed the listing I extracted from the Parquet files.

preliminary

First off here are various counts of interest for both listings. As planned, I did not found files in the processed listing that were not in the theoretical listing, however I did found a small amount of files were duplicated in both listings (ie had the same UUID and file path). I do not know why that was the case, I'm guessing something in PGA, or the way we crawled it.

# of sivas # of repos # of files # of distinct files % of duplicates
processed 204,067 218,023 36,162,330 35,991,340 0.5 %
theoretical 206,084 220,174 40,971,787 40,829,244 0.3 %

In the following I'll be computing stats over the distinct files.

extraction and success rate

So overall, a few things to note:

extraction rate success rate
sivas 99.02 % 100 %
repos 99.02 % 100 %
files 88.15 % 94.26 %
bytes 65.37 % 82.12 %

language specific analysis

I ran the same analysis per language as you asked. As you can see, results are clearly unequal. Looking first at files, we can see 3 groups appear:

file count file extraction rate file success rate
Go 4,126,578 99.88 % 99.88 %
Python 2,994,169 89.70 % 91.93 %
C++ 8,726,368 80.41 % 86.66 %
C# 2,379,754 98.99 % 99.13 %
Java 6,985,742 96.85 % 98.69 %
JavaScript 10,466,131 80.54 % 97.14 %
Ruby 1,143,654 96.70 % 96.76 %
PHP 2,888,395 87.64 % 87.94 %
Shell 1,118,453 87.54 % 88.42 %

Looking now at bytes, we see the same trend as before, ie both rates are lower as the larger files are the ones causing problems. However, there are still some things to note:

byte size byte extraction rate byte success rate
Go 56.48 GB 96.12 % 96.13 %
Python 22.84 GB 84.36 % 86.28 %
C++ 22.84 GB 63.69 % 67.19 %
C# 15.43 GB 93.12 % 93.32 %
Java 42.19 GB 95.26 % 98.94 %
JavaScript 227.68 GB 50.09 % 83.06 %
Ruby 3.42 GB 91.56 % 91.72 %
PHP 15.55 GB 71.92 % 72.10 %
Shell 8.26 GB 25.97 % 26.26 %

repo specific analysis

I plotted the same heatmap as in one of the posts above, it seems the strange distribution was due to errors in my code. This time, I found no correlation between the number of files in a given repository, and the fraction of extracted files in that repository. As you can see from the histograms below, most repos were fully extracted (61 % of them), and the ratio of extracted files per repo actually follow an exponential law, further indicating that we're looking at sporadic events. I did not look per language, I'm guessing we'd find the same distribution but more/less pronounced depending on the driver.

image

conclusion

Anyway, I think I've covered more or less everything. If you want me to go more into detail no problem. By the way, here is a zipped CSV with the subdir, siva, repo_uuid, file_path of the 2,191,222 files that apparently caused a Babelfish error (I remove newlines from file paths).

vmarkovtsev commented 5 years ago

@r0mainK Is it possible to add the language to that CSV file with bblfsh errors?

repo_uuid is the repo name, correct?

r0mainK commented 5 years ago

@vmarkovtsev Added language and repository name. Left the PGA metadata in case, anyway now schema is subdir, siva, repo_uuid, repo_name, lang, file_path.

The zipped CSV can be found here