Closed r0mainK closed 5 years ago
So, after asking LA, it seems that they have discussed it and developed some prototypes, but they have nothing operational. Hence, we will not be compressing the UASTs, and will rely purely on Parquet for compression. They might work on it this quarter, but we shouldn't count on it in the near future.
Once the Spark cluster is usable, PGAv2 is ready, and I create the features for imports task, I will take care of this.
PGAv2 has been copied to the cluster, and Spark-Gitbase are usable. Currently I'm cleaning up the /user/repositories
directory to only have PGA in it, it should take until tomorrow at the current speed, but then will try to do this after the imports task - so I can assess the problems of dealing with so much data.
After talking to Vadim about progress on this, decided to start doing it now. Going to test out different schemes for compression via parquet, as well as see if I'm able to scale things with spark and gitbase-spark-connector-e.
EDIT: Ok, so best scheme is no compression when writing the parquet, then tar/gzip the resulting parquet. It achieves a compression rate almost ten times better then when using the gzip by block that parquet uses.
Okay, so this answer from Miguel and the following messages discussions with Maartje made quite clear that:
This means that:
So in order to do this task, here is the best plan I can come up with:
repository_name
column to the parquet file, as currently this information is not included and some repositories are split across multiple siva files.Had a meeting with Alex, updated checklist in consequence
Ask Vadim which kind of UAST should be retrieved (native, annotated, semantic)
Semantic.
We are meeting with Máximo tomorrow to discuss the problems, he has suggested to betray gitbase in favor of custom Go solution. Let's see.
I coded https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga2uast to do the task. There are 3 stages:
We used the 7 nodes given by the infra, mining-5 to mining-11.
Size of the result from the first stage: 4.5TB. Number of OOMed sivas: 211.
Current progress of the second stage: 17/211 in 3 days. This means ETA is 34 days. However, I am using a single node. Once the DockerHub mining is over I will be able to spread the load over all the 11 nodes.
Just finished the sanity check on the Siva files that were parsed via aggressive parallelization. My workflow was the following:
00
): successsubrepo
, siva hash
, repo uuid
, list of files
): failure The job failed due to the presence of unreadable parquet files. The error was triggered if I read the specific parquet, or if I tried to count the rows of the DataFrame if loading from a subdir. So I went through each subdir, loading then counting the Dataframe, and if an error was caught I loaded each file in the subdir to find the corrupt ones. Once I identified all the files, I moved then to /spark_storage/pga.v2.corrupted
and saved their names in /user/r0maink/sanity/corrupt_pq_1.txt
.
Once that was done, I repeated the previous step, with the same result but due to a different error. This one was also due to corrupt parquet files, but simply loading/counting did not trigger the error, I had to actually try to use the contents, for instance by collecting the rows. So I repeated the error finding process, and then moved them to /spark_storage/pga.v2.corrupted_2
and saved their names in /user/r0maink/sanity/corrupt_pq_2.txt
.
Once that was done, I repeated the previous step, with the same result but due to a JavaHeapSpace
error at about 20-25% of the progress. As I had not optimized the query, knew Infra was gonna work on the cluster today, and did not require working on more then one subdir at a time to compute the CSV, I ran the job on each subdir independently. It just finished after ~12h (this does not really reflected perfs to be expected as I did not try to optimize, group multiple subdirs which would be possible, etc, etc).
Anyway, the CSV weighs 2.02 GB and has 216,752 lines, ie the non-corrupt parquet files contain that much (siva file, repo uuid, file_list)
triplets. The total number of files is 32,851,320. By the way, I removed files that had a null UAST, so there might be more in the parquet files, but simply with empty UASTs.
Here are some stats about the corrupt files (as you can see, they were all concentrated in the same 7 sub directories). Given their number, I think it's worth trying to parse the Siva files once more to see if the error was due to the process or something else (you can take the listings on the ML cluster from the locations given above).
# files | # non-corrupt files | # corrupt 1 | # corrupt 2 | |
---|---|---|---|---|
all subdirs | 203,870 | 203,736 (99.93%) | 30 (0.01%) | 104 (0.05 %) |
subdir 28 | 825 | 804 (97.45 %) | 4 (0.48 %) | 17 (2.06 %) |
subdir 2a | 806 | 781 (96.90 %) | 5 (0.62 %) | 20 (2.48 %) |
subdir 2c | 828 | 804 (97.10 %) | 3 (0.36 %) | 21 (2.54 %) |
subdir 2d | 847 | 818 (96.58 %) | 6 (0.71 %) | 23 (2.72 %) |
subdir 2e | 810 | 777 (95.93 %) | 10 (1.23 %) | 23 (2.84 %) |
subdir 2f | 850 | 849 (99.88 %) | 1 (0.12 %) | 0 (0.00 %) |
subdir 65 | 783 | 782 (99.87 %) | 1 (0.13 %) | 0 (0.00 %) |
size | non-corrupt size | corrupt 1 size | corrupt 2 size | |
---|---|---|---|---|
all subdirs | 4.861 TB | 4.811 TB (98.97 %) | 6.18 GB (0.13 %) | 43.98 GB (0.9 %) |
subdir 28 | 16.16 GB | 11.19 GB (69.25 %) | 1.51 GB (9.37 %) | 3.45 GB (21.38 %) |
subdir 2a | 15.76 GB | 10.87 GB (68.97 %) | 1.92 GB (12.17 %) | 2.97 GB (18.86 %) |
subdir 2c | 24.62 GB | 11.30 GB (45.91 %) | 85 MB (0.35 %) | 13.23 GB (53.74 %) |
subdir 2d | 23.51 GB | 10.36 GB (44.06 %) | 392.95 MB (1.67 %) | 12.76 GB (54.27 %) |
subdir 2e | 19.66 GB | 6.54 GB (33.29 %) | 1.54 GB (7.86 %) | 11.57 GB (58.85 %) |
subdir 2f | 17.93 GB | 17.20 GB (95.97 %) | 721 MB (4.03 %) | 0 B (0.00 %) |
subdir 65 | 15.91 GB | 15.91 GB (99.99 %) | ~0 B (0.01%) | 0 B (0.00 %) |
Great report @r0mainK
This means that I need to re-process a small fraction of files which are corrupted.
Thks, yep the listings are in /user/r0maink/sanity/corrupt_pq_1.txt
and /user/r0maink/sanity/corrupt_pq_2.txt
, if you can put them in a separate directory under /spark_storage/pg1.v2.v2
or something so I can process them directly it would be great. As it's only ~50GB it should not take too long - and hopefully the error will not repeat itself.
The new files were generated and overwritten over the corrupted ones. @r0mainK Please test once again, there shall be no corruptions this time.
I had to write them directly, unfortunately.
@vmarkovtsev no problem, anyway I did not know this but when you call a repartition
on a DataFrame
, it turns out you can't use the built in input_file_name
so the 2 colums for the subdir and siva fp were empty in each row -_-"
So I launched the test once again, will post results once I have them.
Okay, the job finished in 5h30 (the repartition
really was a dumb idea, removing it halved process time), I checked the CSV file this time it's good. It is slightly bigger - especially in term of # files as could be expected from the size of the old corrupted files:
I have created https://github.com/src-d/datasets/pull/158 to list the siva HEADs.
I launched the listing with 32 goroutines on the ML cluster, it digested 17% in 18 hours. ETA 4 days. I will have to interrupt it on Wednesday though.
I parsed the OOMed sivas. I was able to process 204/211 files. The results are merged with the main dataset.
@r0mainK it is time to run the check again!
Regarding the listing, it is 80% complete. ETA Friday.
Awesome, I've relaunched the process with the same config, let's see how it goes - I expect it to be done by tomorrow, unless something goes horribly wrong :crossed_fingers:
@vmarkovtsev extraction completed ! It ran in 6 hours 8 mins, so a bit more then last time. Unfortunately, it was not 100% of all files, there are currently 835 siva/parquet files missing from the old index. I cross checked, and it seems all the missing files where from the 00
subdirectory, which contains a bit over that amount of files, surprisingly.
So I tried to read and count it, and it indeed caused an error. I inspected the directory, there are 3 new files:
-rw-rw-r--. 1 1004 1004 1.8G Aug 3 09:05 00824011103c689db12451a6f73f84b57a6d05e0.parquet
-rw-rw-r--. 1 1004 1004 3.0G Aug 3 08:22 0079cc5fa5b7d13fd201fbae276b01f7f27f8dc9.parquet
-rw-rw-r--. 1 1004 1004 17G Aug 3 02:31 0067e598fa2532b9a914984456d6bff752a0cfd3.parquet
I loaded each one individually and tried to collect them, and you guessed it, the 2 first ones did not cause any error, it was the third 17GB one that did and caused the whole subdir to crash. So I moved that single file to /spark_storage/the_bad_siva/
and afterwards in worked. Anyway, for the sake of comparing true run times (and since we won't have the listing until Friday in all cases) I'm gonna relaunch the whole process, it should be over by this evening - and I'll add final metrics here.
It ran in 6 hours 8 mins
I am still listing files in PGA, so the FS performance was degraded.
I have renamed /spark_storage/the_bad_siva/
to /spark_storage/bad_uasts/
.
I didn't fully get 835. You are saying that there are 835 siva files under 00
which are in the index but are not extracted, right?
I didn't fully get 835. You are saying that there are 835 siva files under 00 which are in the index but are not extracted, right?
@vmarkovtsev no what I meant was that there were 835 missing files in the new index, that where already present in the previous index. This was due to the fact that there was one corrupt file added to the00
subdir that made the job on that subdir completely fail, thus making the 835 files, plus the new ones, not appear in the in new index. But all of those files were extracted successfully, Spark just failed to process them due to that one bad siva.
The file listing is at:
/spark_storage/files_pga.v2
However, the listing has 148230 files compared to 204069 uasts. Weird. I have to re-launch the listing on the missing files.
I renamed /spark_storage/uast_pga.v2
to /spark_storage/uasts_pga.v2
@r0mainK The listing is finally over! 205546 files.
/spark_storage/files_pga.v2
The structure is flat, there are no subdirs.
I set the access for all dirs and subdirs in /spark_storage
to 555
. We've got
sivas_pga.v2
- the original PGAuasts_pga.v2
- UASTsfiles_pga.v2
- listingbad_siva
- temporary, to be deletedI have made a check on the processed index, the 14
subdir was not processed - the spark jobs on it failed with an OOM Java error. I traced back the origin of the problem to a single files (14/147288108757caed09e0c65d9ec098b821129eba.parquet
) which I added to bad_uasts
directory. Relaunching processing. Once it is done, I will finish up this task.
Okay so I finished the extraction, without any errors :100: I did find there was an issue with parquet file 61/614fa43723122e2a8318d65104991163b9915d72.parquet
(it was empty) so I moved it to the bad_uasts
folder.
As expected, the CSV file is a bit bigger (2.24 GB), and now contains 218,081 lines (UUID-Siva/Parquet files pairs), and a total of 36,109,756 files across all repos. This means that the stragglers added increased the file count by about 8.3 %. Also, it seems there was some duplication across PGA (most probably some files were processed twice in different UUIDS), as I found distinct 35,991,897 files over 218,023 distinct UUIDs.
I then extracted from the theoretical listings for each repo that Vadim provided the list of files per UUID and, then did the sanity check. Although Vadim has warned the theoretical listings was incomplete, I still found more distinct UUIDs (219,610) and a total of 40,285,913 distinct files.
Anyways here are the results of the sanity check:
file count | % of union | uuid count | % of union | |
---|---|---|---|---|
union of both listings | 40,603,063 | 100 % | 219,610 | 100 % |
intersection of both listings | 35,674,747 | 87.86 % | 218,023 | 99.28 % |
only in parquet listing | 317,150 | 0.78 % | 0 | 0 % |
only in theoretical listing | 4,611,166 | 11.36 % | 1,587 | 0.72 % |
I also looked into more granular results:
1st quarter | Mean | Median | 3rd quarter | |
---|---|---|---|---|
% of extracted files per UUID | 80 % | 86 % | 92 % | 99 % |
As can be seen, although overall we extracted ~88% of files, the extraction rate varies a lot depending on the repo. AS you can see on the scatter plot below, there seems to be a positive correlation between the number of files in the repo, and the amount that are extracted, but nothing much more. I suspect if we looked at these rates per languages we would probably find most errors come from specific driver.
Awesome!
Is it possible to study per language, also to gather repo/paths of files which could not be extracted?
@vmarkovtsev was about to edit my message above :p So yeah that's on me actually, I think I had mentionned in meetings it would be useful to have the language of each file in the parquet files, but I forgot to put it down by writing in this issue. So currently, I could only do this using Regexp with the filenames. I already had some experience doing this way back for the apollo blogpost, when we had a similar albeit much smaller dataset, and it was pretty bad.
I think we should just rerun the listing and add this information, if it is possible ? Getting the bytesize of each file would be interesting as well I think. If the processing is as efficient as the first time, we will miss less then 1% of files, and we can get that number further down with regexp. What do you think ?
Also yes, I can create a CSV with the following schema if you want, using the CSVs I've created and the index: subdir,siva_hash,repo_uuid,repo_name,file_name
OK, I will edit the code and re-launch the listing tomorrow.
I launched the new listing.
It is funny that we've got 317150 files only in the UASTs. I hope that this time a clean run will be flawless.
Yeah, it's surprising, especially as all of those files are in repos that were listed at least in part.
Writing this while I remember. An important detail how we should calculate the success rate: it must be calculated on the intersection of \<sivas listed> x \<uasts extracted>. In other words, we should ignore listed siva files which were not at least partially extracted.
The reason is that a failure to extract a siva is not necessary due to Babelfish: the file can be just too big for the given hardware and algorithm.
Ah yeah, nice point I had not considered it :100:
I just recomputed the values, by defining the real union with your definition, ie files in sivas where at least one file was extracted. I found 38,500,270 files, which is 94.82 % of the previous union. This means that:
@r0mainK The new listing has finished, it is in the same place. 205283 files. As you see, this number is less than the previous one and I am going crazy to find out why. I need to find the missing names and re-launch on them.
Update: I found the missing 801 files, listed them and put to the output directory. Now the overall count is 206084 and we should not have weird files that are present only in the uasts.
Please run the stats per language!
@vmarkovtsev running the stats noew, however still missing 97,049 files in the listing. They come from 402 siva files, all of which were in the listing (save for the repos). No idea why they were not listed. Anyway, the exact list is here: missing.txt (format is subdir,siva_hash,repo_uuid,file_name)
I will update today with language stats
EDIT: okay so this might actually be a bug in my reading of the CSV files (newlines in some of the filenames, just loving it) EDIT2: yep its the fucking newlines. If there are still any missing will tell you. EDIT3: Ok ffs this is gonna take me a bit more time, will update after the retreat. Some people are just goddamn dumb, naming files with comas, newlines and quotations marks, which fucked up my CSV. Im gonna recreate the listings taking that into account
OK so this is is the final report (hopefully). In the following, I'll be calling extracted rate the ratio of files extracted over all files, and success rate the ratio of files extracted over files in repositories at least partially extracted (ie where Babelfish errored, not something else). I'll also be calling theoretical the listing Vadim created from crawling PGA, and processed the listing I extracted from the Parquet files.
First off here are various counts of interest for both listings. As planned, I did not found files in the processed listing that were not in the theoretical listing, however I did found a small amount of files were duplicated in both listings (ie had the same UUID and file path). I do not know why that was the case, I'm guessing something in PGA, or the way we crawled it.
# of sivas | # of repos | # of files | # of distinct files | % of duplicates | |
---|---|---|---|---|---|
processed | 204,067 | 218,023 | 36,162,330 | 35,991,340 | 0.5 % |
theoretical | 206,084 | 220,174 | 40,971,787 | 40,829,244 | 0.3 % |
In the following I'll be computing stats over the distinct files.
So overall, a few things to note:
extraction rate | success rate | |
---|---|---|
sivas | 99.02 % | 100 % |
repos | 99.02 % | 100 % |
files | 88.15 % | 94.26 % |
bytes | 65.37 % | 82.12 % |
I ran the same analysis per language as you asked. As you can see, results are clearly unequal. Looking first at files, we can see 3 groups appear:
file count | file extraction rate | file success rate | |
---|---|---|---|
Go | 4,126,578 | 99.88 % | 99.88 % |
Python | 2,994,169 | 89.70 % | 91.93 % |
C++ | 8,726,368 | 80.41 % | 86.66 % |
C# | 2,379,754 | 98.99 % | 99.13 % |
Java | 6,985,742 | 96.85 % | 98.69 % |
JavaScript | 10,466,131 | 80.54 % | 97.14 % |
Ruby | 1,143,654 | 96.70 % | 96.76 % |
PHP | 2,888,395 | 87.64 % | 87.94 % |
Shell | 1,118,453 | 87.54 % | 88.42 % |
Looking now at bytes, we see the same trend as before, ie both rates are lower as the larger files are the ones causing problems. However, there are still some things to note:
byte size | byte extraction rate | byte success rate | |
---|---|---|---|
Go | 56.48 GB | 96.12 % | 96.13 % |
Python | 22.84 GB | 84.36 % | 86.28 % |
C++ | 22.84 GB | 63.69 % | 67.19 % |
C# | 15.43 GB | 93.12 % | 93.32 % |
Java | 42.19 GB | 95.26 % | 98.94 % |
JavaScript | 227.68 GB | 50.09 % | 83.06 % |
Ruby | 3.42 GB | 91.56 % | 91.72 % |
PHP | 15.55 GB | 71.92 % | 72.10 % |
Shell | 8.26 GB | 25.97 % | 26.26 % |
I plotted the same heatmap as in one of the posts above, it seems the strange distribution was due to errors in my code. This time, I found no correlation between the number of files in a given repository, and the fraction of extracted files in that repository. As you can see from the histograms below, most repos were fully extracted (61 % of them), and the ratio of extracted files per repo actually follow an exponential law, further indicating that we're looking at sporadic events. I did not look per language, I'm guessing we'd find the same distribution but more/less pronounced depending on the driver.
Anyway, I think I've covered more or less everything. If you want me to go more into detail no problem. By the way, here is a zipped CSV with the subdir, siva, repo_uuid, file_path
of the 2,191,222 files that apparently caused a Babelfish error (I remove newlines from file paths).
@r0mainK Is it possible to add the language to that CSV file with bblfsh errors?
repo_uuid
is the repo name, correct?
Context
As soon as Infra team has copied PGAv2 to the ML cluster, we will start using it often. Most of the time, we will be extracting UAST from the HEAD, and then do something. There is no reason to repeatedly query Gitbase to do this, so we should do it only once.
Task
Use GitBase to extract and store for all parsable files of the HEAD the UAST, repository name, file name and language. The storage format should be compatible with Spark so we can easily reuse it, hence it should probably be Parquet. The UASTs being relatively heavy, we should see if we can compress them further beforehand, see if LA team has any insight on this.
Checklist