mozilla / participation-metrics-org

Participation metrics planning repository
4 stars 4 forks source link

Identities not grouped properly #91

Open hmitsch opened 7 years ago

hmitsch commented 7 years ago

Dashboard: Git

Test condition

Expected result

Actual result

Mte90 is Daniele (see screenshot below). Look like the identity grouping does not work?

Dashboard short link: https://analytics.mozilla.community:443/goto/e797e57067f7b0a8c3ef40b37f6f01c0

Mozillians.org entry

image
canasdiaz commented 7 years ago

The user used different email addresses and we group them by email and name. What we could do is bit more aggressive and match also using the username. This could lead to false positives, but it could fix our problems now.

Profile:
    * Name: Mte90
    * E-Mail: mte90smanetta***@*****
    * Bot: No
    * Country: -

Identities:
  67f836dd4f966eecbd23cf26678fe5ffcf87d805      Mte90   mte90smanetta***@*****      -       git
  ea7120e9365339b8dbf6f00f4e9add92adfa0230      Mte90   mte90smanetta***@*****      Mte90   git

Enrollments:
  Community     1900-01-01 00:00:00     2100-01-01 00:00:00

What do you think @hmitsch? I would say it is likely we are not going to have a lot of false positives, but you guys are using a lot of data sources, so I can not know until we run the process.

hmitsch commented 7 years ago

Ok, I think I understand. Let's review the artifacts ..

@Mte90 has the following entries in his Mozillians.org profile:

image image

By having mte90smanetta***@***** used on his @Mte90 Github account and not having that email address listed in Mozillians.org, he confused our identity building?

If the above is right, I indeed suggest to be more aggressive in the heuristics. In the worst case this will lead to a under-reporting of contributors. I'd rather have that problem than being called out for reporting too big numbers.

canasdiaz commented 7 years ago

The unifiy finished. Mte90 was correctly unified.

~/affiliate$ docker exec -it mozilla_mordred_1 sortinghat -u root -p **** --host **** -d mozilla_sh unify --fast-matching -m username

Total unique identities processed: 781835
Total matches: 19571
Total unique identities after merging: 762264
canasdiaz commented 7 years ago

Some identities like 000ef07a6ab623404a53fd42d4de7cf782dd53fb grouped 2 hundred accounts under the umbrella of "Mauricio Navarro Miranda". So the activity of other accounts is misrepresented. I see git accounts with names like:

This is happening due to some common user name that lead the heuristic to mark them as the same person. We have to:

Example:

For a given unify identity I found these three using a common email that must be added to a blacklist. If not, it starts grouping the other accounts for all the relationships created with the fourth column.

  2ebebdfd54633983e1f9c9ec4d24cae53644a2cd      GitHub  noreply@github.com      victorporof     git
  32d6a4f1fc84a19d8901bfd2b09ee2e9f6aa34e3      GitHub  noreply@github.com      kumar303        git
  6bc0c90fcaf2e5deb5f61a36ef3aa6d324215eae      GitHub  noreply@github.com      jsantell        git
canasdiaz commented 7 years ago

After having a deep look at the data with my colleagues I confirm we can not use the "usernames" to unify accounts. We are having a lot of wrong unifications. We are broking the most active ones to improve the data quality. Again this is manual work. After this is done we'll refresh the index (4 hours more)

Mte90 commented 7 years ago

You are right during the time I changed the email to better ones (short names) and for a better alias (and I forgotten this situation). Maybe the better way is to add all the various option on mozillians and hide them from the profile to be showed only from owner of the profile.