src-d / blog

source{d} blog
https://blog.sourced.tech/
GNU General Public License v3.0
27 stars 41 forks source link

[PROPOSAL] PGA deduplication #237

Closed r0mainK closed 5 years ago

r0mainK commented 6 years ago

PGA deduplication

Table of contents

Explain how we ran apollo on PGA, detail the numerous problems that were encountered and solutions we found.

Describe the datasets we'll release with the blogpost. 3 parts, all released as asdf models: the bags of features (~60GB), the connected components obtained after hashing (~600M), the detected communities (~480MB per community detection algorithms, 10 algorithms).

EDIT:

Also there were some problems during hashing I did in my last weeks of work that I just saw (only a third of the documents were processed hadnt seen the hashing was also unstable -_-" ) so I am rerunning hashing. I figured it would be interesting to put a similarity threshold of .95 like we planned but also one of .80. In the DejaVu paper they used SourcererCC whith the default 80% threshold on token-similarity to consider two files to be similar, so I think it makes sense. This means there will be twice as many connected components and detected communities models.

END EDIT

Analyze results using the Dejavu paper metrics and any other that seems sensible without going too much in detail (too many components to do so -> 110,766 show the cooler plots of communities and analyze them, possibly cross algorithm.

Management

This section will be filled by @campoy.

Social Media

NOTE Please write in short lines so the review is easier to do.

vmarkovtsev commented 5 years ago

This is actually done: https://github.com/src-d/blog/pull/242