PGA deduplication

Title: PGA deduplication, or Apollo on PGA
Author(s): Romain
Short description: Running apollo on PGA
Categories: Big Code, ML, Dataset
Deadlines: Will be writing the post the next 2 weeks. Datasets will be in PR until release of the post in src-d/models.

How (50 %)

Explain how we ran apollo on PGA, detail the numerous problems that were encountered and solutions we found.

Dataset (25 %)

Describe the datasets we'll release with the blogpost. 3 parts, all released as asdf models: the bags of features (~60GB), the connected components obtained after hashing (~600M), the detected communities (~480MB per community detection algorithms, 10 algorithms).

EDIT:

Also there were some problems during hashing I did in my last weeks of work that I just saw (only a third of the documents were processed hadnt seen the hashing was also unstable -_-" ) so I am rerunning hashing. I figured it would be interesting to put a similarity threshold of .95 like we planned but also one of .80. In the DejaVu paper they used SourcererCC whith the default 80% threshold on token-similarity to consider two files to be similar, so I think it makes sense. This means there will be twice as many connected components and detected communities models.

END EDIT

Results (25 %)

Analyze results using the Dejavu paper metrics and any other that seems sensible without going too much in detail (too many components to do so -> 110,766 show the cooler plots of communities and analyze them, possibly cross algorithm.

Management

This section will be filled by @campoy.

State: (proposed | writing | written | published)
Scheduled:
Link to post:

Social Media

Wording for tweet:
Hashtags:
Subreddits:

NOTE Please write in short lines so the review is easier to do.

src-d / blog

[PROPOSAL] PGA deduplication #237

PGA deduplication

Table of contents

EDIT:

END EDIT

Management

Social Media