src-d / identity-matching

source{d} extension to match Git signatures to real people.
GNU General Public License v3.0
17 stars 13 forks source link

Bad precision and recall (~60%) on IBM and intel open source stacks #58

Closed warenlg closed 5 years ago

warenlg commented 5 years ago

Following up https://github.com/src-d/identity-matching/issues/17 and https://github.com/src-d/identity-matching/issues/30 where the performance of the identity merging algorithm has been evaluated on 22 different open source stacks. We noticed particular bad performance on 2 organization IBM and intel with ~60% precision and recall.

This needs to be investigated because nearly all other organization are above 90% precision and recall, and we should be able to promise an acceptable score (at least 90 %) on all organizations.

warenlg commented 5 years ago

It turns out the identity graph of intel and IBM were pretty big: 80k and 11k edges respectively. And reducing the proportion of popular names decreased the number of false positive and false negative as popular identities tend to be the ones with problems. That's why increasing the popularity threshold from 5 to 100, we improved our precision and recall from ~62 to 94% for both organizations.

identity_prec_rec

warenlg commented 5 years ago

We can not increase the popular threshold too much though otherwise we start loosing precision at some point.

vmarkovtsev commented 5 years ago

Great data analysis Waren :+1: Let's use your recommended threshold 100 and update the CSVs/embedded Go code.

warenlg commented 5 years ago

Thanks just opened the PR this morning https://github.com/src-d/identity-matching/pull/59 Now closed.