src-d / ml-backlog

Issues belonging to source{d}'s Machine Learning team which cannot be related to a specific repository.
0 stars 3 forks source link

Publish the UASTs dataset #97

Closed vmarkovtsev closed 5 years ago

vmarkovtsev commented 5 years ago

We need to publish our UASTs dataset to src-d/datasets. There should be two formats:

Given the dataset size, we need to ask infra for the resources.

vmarkovtsev commented 5 years ago

All the UASTs have been pushed, beanstalkd queue after 4 passes:

+---------+----------+----------+----------+----------+----------+----------+----------+
| Name    | Buried   | Delayed  | Ready    | Reserved | Urgent   | Waiting  | Total    |
+---------+----------+----------+----------+----------+----------+----------+----------+
| default | 0        | 0        | 0        | 0        | 0        | 0        | 204069   |
+---------+----------+----------+----------+----------+----------+----------+----------+
  1. 2-core, 4GB nodes, 2 file readers, 6 db pushers. Leaves 1.3k.
  2. 2-core, 4GB nodes, 1 file reader, 1 db pusher. Leaves 250.
  3. 2-core, 16GB nodes, 1 file reader, 1 db pusher, fixed DB error handling. Leaves 11.
  4. 4-core, 32GB nodes, 1 file reader, 1 db pusher. Leaves nothing :tada:
r0mainK commented 5 years ago

Alright ! So how do you want to release it ? I would propose adding to the PGA directory in datasets a new subdirectory (extracted-uasts) with as much information as possible:

WDYT ?

vmarkovtsev commented 5 years ago

We could add that dir, yes. Can you please consult with the infra. I would name it just uast similar to existing siva.

r0mainK commented 5 years ago

Update :

Work left

EDIT:

vmarkovtsev commented 5 years ago

@r0mainK My plan is to create PublicGitArchiveUASTs top level directory with the links to two datasets. Also, link there from the regular PGA.

r0mainK commented 5 years ago

@vmarkovtsev okay, getting onto that then

vmarkovtsev commented 5 years ago

I will write the manual on how to work with Clickhouse now.

vmarkovtsev commented 5 years ago

https://github.com/src-d/datasets/pull/170

r0mainK commented 5 years ago

@vmarkovtsev I updated the post with the checklist for the status update. I think everything here is done, we can close this issue ?

vmarkovtsev commented 5 years ago

Yes, this is all done.