src-d / datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code
Other
322 stars 82 forks source link

[PGA] add pga-create repack command #79

Closed smola closed 6 years ago

smola commented 6 years ago

Add a pga-create repack command. This downloads latest GHTorrent MySQL dump and repacks it on the fly to store a local copy without the files that are not needed. This is particularly useful during development of pga-create, since the repacked version can be processed 10-20 minutes, while processing the original GHTorrent MySQL dump takes hours.

This PR depends on https://github.com/src-d/datasets/pull/78 (check only last commit) and do not merge.

smola commented 6 years ago

Some of the duplication is solved in a later PR: https://github.com/src-d/datasets/pull/81/commits/fdd6e5cd0d49612a895018d87dcf56f7cd19f6ff