src-d / ml-backlog

Issues belonging to source{d}'s Machine Learning team which cannot be related to a specific repository.
0 stars 3 forks source link

Identity matching #69

Closed EgorBu closed 5 years ago

EgorBu commented 5 years ago

As developer/researcher who wants to analyze code bases, I want to be able to identify developers based on information available from git history.

Identity matching is an important problem for almost any possible customer. Whenever we will use code bases from different companies - we will meet issues that the same developer uses different names/emails in commits. We should be able to handle this situation properly. And there should be python module for it.

A short summary of the existing approaches could be found here

EgorBu commented 5 years ago

Pair programming issue https://github.com/src-d/feature-idea/issues/144

vmarkovtsev commented 5 years ago

I started the dataset collection again on the ML cluster, if it does not get killed in 1 hour - it will finish by the next week.

vmarkovtsev commented 5 years ago

I am currently running a robust collection again. The previous process just stopped writing results for an unknown reason. The result is split into chunks, and it is possible to continue without losing the progress.

/user/legacy/backup/ghtorrent

vmarkovtsev commented 5 years ago

This has moved under the scope of https://github.com/src-d/eee-identity-matching