src-d / identity-matching

source{d} extension to match Git signatures to real people.
GNU General Public License v3.0
17 stars 13 forks source link

The list of popular names is too large #57

Closed warenlg closed 5 years ago

warenlg commented 5 years ago

Currently when running match-identities, we use a pre-compiled list of popular names.

This list is very large: 55659 names and it includes names that are obviously not popular: emanuele caprioli, ludovic menthiller, thomas flahault, bryce cuthriell, ... etc

It looks like hyperopt has been running on a huge dataset that is not representative to the real use case, whereas the design document says that the use case of identity-matching should be one organization with less than 10k devs and repos. Thus, it looks like we have to lower the threshold and recompile the list.

warenlg commented 5 years ago

With a new threshold, the list of popular names decreased to 1025 samples. We don't have those full names included anymore.

The length of the identity tables also decreased from 5% to 60% depending on the organization, making it more readable.

identity_table_len