Publish the UASTs dataset

vmarkovtsev commented 5 years ago

We need to publish our UASTs dataset to src-d/datasets. There should be two formats:

Parquet files (5TB)
ClickHouse DB (400GB) - this requires some research, simple FS tar should be enough BUT the schema should be saved, too. Perhaps AlexAkulov/clickhouse-backup will work, we need to check if it preserves the indexes.

We should include the reports from #74 There should be some sample code how to work with both formats.

Given the dataset size, we need to ask infra for the resources.

vmarkovtsev commented 5 years ago

All the UASTs have been pushed, beanstalkd queue after 4 passes:

+---------+----------+----------+----------+----------+----------+----------+----------+
| Name    | Buried   | Delayed  | Ready    | Reserved | Urgent   | Waiting  | Total    |
+---------+----------+----------+----------+----------+----------+----------+----------+
| default | 0        | 0        | 0        | 0        | 0        | 0        | 204069   |
+---------+----------+----------+----------+----------+----------+----------+----------+

2-core, 4GB nodes, 2 file readers, 6 db pushers. Leaves 1.3k.
2-core, 4GB nodes, 1 file reader, 1 db pusher. Leaves 250.
2-core, 16GB nodes, 1 file reader, 1 db pusher, fixed DB error handling. Leaves 11.
4-core, 32GB nodes, 1 file reader, 1 db pusher. Leaves nothing :tada:

r0mainK commented 5 years ago

Alright ! So how do you want to release it ? I would propose adding to the PGA directory in datasets a new subdirectory (extracted-uasts) with as much information as possible:

complete listing of files in each parquet,
the reports in the issue #74 ,
as you said some example code to DL and use

WDYT ?

vmarkovtsev commented 5 years ago

We could add that dir, yes. Can you please consult with the infra. I would name it just uast similar to existing siva.

r0mainK commented 5 years ago

Update :

The pga cli has been refactored and updated, we can now download the UAST dataset using it.
The issue for the dump of the clickhouse dataset is open with low priority, no ETA yet.

Work left

[x] Update the pga cli docs here and here for the new usage.
[x] Move everything in PublicGitArchive/doc to PublicGitArchive/doc/v1 as all of these documents pertain to that version, add a README file to organize everything.
[x] Rework this report and add it under v2 (add explanation of how to reiterate the process).
[x] Once the Clickhouse Instance has been dumped, add the link and stats either to the uast2clickhouse repo, to a new directory in datasets (FlattenedPGA ?) or add a subfolder to the PublicGitArchive directory.

EDIT:

README.mds in datasets/PublicGitArchive were updated
Poster assets in datasets/PublicGitArchive/doc moved to separate directory, added an index for the remaining docs
Report was integrated in README.md of new PublicGitArchiveUASTs top-level directory
Link, stats and doc for ClickHouse dump was put in PublicGitArchiveUASTs/Clickhouse

vmarkovtsev commented 5 years ago

@r0mainK My plan is to create PublicGitArchiveUASTs top level directory with the links to two datasets. Also, link there from the regular PGA.

r0mainK commented 5 years ago

@vmarkovtsev okay, getting onto that then

vmarkovtsev commented 5 years ago

I will write the manual on how to work with Clickhouse now.

vmarkovtsev commented 5 years ago

https://github.com/src-d/datasets/pull/170

r0mainK commented 5 years ago

@vmarkovtsev I updated the post with the checklist for the status update. I think everything here is done, we can close this issue ?

vmarkovtsev commented 5 years ago

Yes, this is all done.

src-d / ml-backlog

Publish the UASTs dataset #97