Closed vmarkovtsev closed 5 years ago
All the UASTs have been pushed, beanstalkd queue after 4 passes:
+---------+----------+----------+----------+----------+----------+----------+----------+
| Name | Buried | Delayed | Ready | Reserved | Urgent | Waiting | Total |
+---------+----------+----------+----------+----------+----------+----------+----------+
| default | 0 | 0 | 0 | 0 | 0 | 0 | 204069 |
+---------+----------+----------+----------+----------+----------+----------+----------+
Alright ! So how do you want to release it ? I would propose adding to the PGA directory in datasets
a new subdirectory (extracted-uasts
) with as much information as possible:
WDYT ?
We could add that dir, yes. Can you please consult with the infra. I would name it just uast
similar to existing siva
.
Update :
pga
cli has been refactored and updated, we can now download the UAST dataset using it.Work left
PublicGitArchive/doc/v1
as all of these documents pertain to that version, add a README file to organize everything.FlattenedPGA
?) or add a subfolder to the PublicGitArchive
directory. EDIT:
README.md
s in datasets/PublicGitArchive
were updateddatasets/PublicGitArchive/doc
moved to separate directory, added an index for the remaining docsREADME.md
of new PublicGitArchiveUASTs
top-level directoryPublicGitArchiveUASTs/Clickhouse
@r0mainK My plan is to create PublicGitArchiveUASTs
top level directory with the links to two datasets. Also, link there from the regular PGA.
@vmarkovtsev okay, getting onto that then
I will write the manual on how to work with Clickhouse now.
@vmarkovtsev I updated the post with the checklist for the status update. I think everything here is done, we can close this issue ?
Yes, this is all done.
We need to publish our UASTs dataset to src-d/datasets. There should be two formats:
ClickHouse DB (400GB) - this requires some research, simple FS tar should be enough BUT the schema should be saved, too. Perhaps AlexAkulov/clickhouse-backup will work, we need to check if it preserves the indexes.
We should include the reports from #74 There should be some sample code how to work with both formats.
Given the dataset size, we need to ask infra for the resources.