src-d / datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code
Other
321 stars 82 forks source link

Support the UAST dataset in `pga` tool #167

Closed r0mainK closed 4 years ago

r0mainK commented 4 years ago

For more info on context see this ml issue and this infra issue.

This PR only deals with the first dataset, ie the UASTs extracted from the HEAD of PGAv2, and stored as parquet. It does not contain yet any documentation, apart from in side the code - and I did not put much effort into the comments, as I wasn't sure the logic would be kept. Anyway here's a run down of the changes:

commit 1: typos

commit 2 and 3:

Add uast_dataset.go and siva_dataset.go. Basically, since both datasets are comprised of individual files I wanted to keep the same commands for listing and downloading, so I had to abstract all the logic from IndexToCSV, RepositoryFromCSV and ToCSV, as well as the schemas. In order to do that I:

commit 4

Following the previous changes, I basically removed all code I move to siva_dataset, added parsers/formatter for floats with 2 point precision as well as a csvColumn field, and created a mapping for datasets. Also modified the types as Repository is now an interface, and (I think) can't be pointed to the same.

commit 5

Now using getters in filters.go, and updated typing.

commit 6

Added a handler for the datasetName arg now used in get/list.

commit 7

Updated commands to the new structure, more of the same: getters, typing, now using Dataset and Dataset handler.

Note: I was thinking of adding a parquet command to dump listing of individual parquet files, as well as some additional filters to be used only by the uast dataset, related to the extraction rate. What do you think ?

r0mainK commented 4 years ago

@vmarkovtsev this is rebased and modified according to the first part. In order to avoid some redundancy between siva.go and uast.go I replaced the ForEach method with a ForEachRepository function that takes the dataset as input. This allowed me to refactor Uast/SivaRepositoryFromCSV functions in a common RepositoryFromCSV interface method.

r0mainK commented 4 years ago

Just added a couple commits: temporary files were not cleaned in case of command cancellation - furthermore the change I brought to stop updating the Bar in case of an issue with pga get also caused files to not be cleaned, in the case multiple files were being downloaded.

r0mainK commented 4 years ago

@vmarkovtsev second split is done, all refactoring (4 fisrt commits) is in this PR