src-d / datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code
Other
321 stars 82 forks source link

Include the refs belonging to each siva file to the PGA index #156

Open vmarkovtsev opened 5 years ago

vmarkovtsev commented 5 years ago

The current PGA index format does not allow to understand under which references a given URL is written. For example, tensorflow/tensorflow belongs to 2 siva files, the first has two heads and the second - tens. I need to have the full references collection in the second CSV column, e.g.

"c19e4a1b8c7f458fa4d6b0978e2a14ef8c2a2ff2.siva[refs/heads/<uuid>,refs/whatever/<uuid>],f41959ccb2d9d4c722fe8fc3351401d53bcf4900.siva[refs/heads/<uuid>,...]"
jfontan commented 5 years ago

UUID for each repository in the siva files can be found as remote data. The name of the remote is the UUID that can be used to filter repositories and you can identify them by its endpoint. References can be filtered with this regexp .*\/<uuid>$.

vmarkovtsev commented 5 years ago

I will collect the mapping and include it in the dataset because this is a common issue for the team.

vmarkovtsev commented 5 years ago

Done: heads.csv.gz

cc/ @r0mainK

vmarkovtsev commented 5 years ago

I need to update the dataset on Monday.