Closed warenlg closed 5 years ago
It turns out the identity graph of intel and IBM were pretty big: 80k and 11k edges respectively. And reducing the proportion of popular names decreased the number of false positive and false negative as popular identities tend to be the ones with problems. That's why increasing the popularity threshold from 5 to 100, we improved our precision and recall from ~62 to 94% for both organizations.
We can not increase the popular threshold too much though otherwise we start loosing precision at some point.
Great data analysis Waren :+1: Let's use your recommended threshold 100 and update the CSVs/embedded Go code.
Thanks just opened the PR this morning https://github.com/src-d/identity-matching/pull/59 Now closed.
Following up https://github.com/src-d/identity-matching/issues/17 and https://github.com/src-d/identity-matching/issues/30 where the performance of the identity merging algorithm has been evaluated on 22 different open source stacks. We noticed particular bad performance on 2 organization
IBM
andintel
with ~60% precision and recall.This needs to be investigated because nearly all other organization are above 90% precision and recall, and we should be able to promise an acceptable score (at least 90 %) on all organizations.