Open hmitsch opened 7 years ago
The user used different email addresses and we group them by email and name. What we could do is bit more aggressive and match also using the username. This could lead to false positives, but it could fix our problems now.
Profile:
* Name: Mte90
* E-Mail: mte90smanetta***@*****
* Bot: No
* Country: -
Identities:
67f836dd4f966eecbd23cf26678fe5ffcf87d805 Mte90 mte90smanetta***@***** - git
ea7120e9365339b8dbf6f00f4e9add92adfa0230 Mte90 mte90smanetta***@***** Mte90 git
Enrollments:
Community 1900-01-01 00:00:00 2100-01-01 00:00:00
What do you think @hmitsch? I would say it is likely we are not going to have a lot of false positives, but you guys are using a lot of data sources, so I can not know until we run the process.
Ok, I think I understand. Let's review the artifacts ..
@Mte90 has the following entries in his Mozillians.org profile:
By having mte90smanetta***@*****
used on his @Mte90 Github account and not having that email address listed in Mozillians.org, he confused our identity building?
If the above is right, I indeed suggest to be more aggressive in the heuristics. In the worst case this will lead to a under-reporting of contributors. I'd rather have that problem than being called out for reporting too big numbers.
The unifiy finished. Mte90 was correctly unified.
~/affiliate$ docker exec -it mozilla_mordred_1 sortinghat -u root -p **** --host **** -d mozilla_sh unify --fast-matching -m username
Total unique identities processed: 781835
Total matches: 19571
Total unique identities after merging: 762264
Some identities like 000ef07a6ab623404a53fd42d4de7cf782dd53fb grouped 2 hundred accounts under the umbrella of "Mauricio Navarro Miranda". So the activity of other accounts is misrepresented. I see git accounts with names like:
This is happening due to some common user name that lead the heuristic to mark them as the same person. We have to:
Example:
For a given unify identity I found these three using a common email that must be added to a blacklist. If not, it starts grouping the other accounts for all the relationships created with the fourth column.
2ebebdfd54633983e1f9c9ec4d24cae53644a2cd GitHub noreply@github.com victorporof git
32d6a4f1fc84a19d8901bfd2b09ee2e9f6aa34e3 GitHub noreply@github.com kumar303 git
6bc0c90fcaf2e5deb5f61a36ef3aa6d324215eae GitHub noreply@github.com jsantell git
After having a deep look at the data with my colleagues I confirm we can not use the "usernames" to unify accounts. We are having a lot of wrong unifications. We are broking the most active ones to improve the data quality. Again this is manual work. After this is done we'll refresh the index (4 hours more)
You are right during the time I changed the email to better ones (short names) and for a better alias (and I forgotten this situation). Maybe the better way is to add all the various option on mozillians and hide them from the profile to be showed only from owner of the profile.
Dashboard: Git
Test condition
repo_name:"https://github.com/mozilla/remo.git"
Expected result
Actual result
Mte90 is Daniele (see screenshot below). Look like the identity grouping does not work?
Dashboard short link: https://analytics.mozilla.community:443/goto/e797e57067f7b0a8c3ef40b37f6f01c0
Mozillians.org entry