Increase quality of identities database

canasdiaz commented 5 years ago

:red_circle: This is high priority

Due to the current matching algorithm we are using to merge user identities we are having people having more than 1 unified identity. @hmitsch already got some examples and numbers that should be improved.

Steps:

remove data from decommissioned data sources if their field name can create issues with a more aggressive algorithm. Example: the names in IRC can be quite dirty and matching by name with IRC would create unexpected results
get numbers about the current number of identities + unified identities
apply an aggressive algorithm
get again the numbers about number of identities + unified identities
update the following set of users:
- Reps bucket will be moved to Community
- Sheriffs will be moved to Staff
refresh the indexes, so this information is updated on them

In order to make sure we are improving the data we will use a set of checkpoints. Some of them started to be gathered by @hmitsch. We (Bitergia) will collect them and review to be sure data quality is increased.

hmitsch commented 5 years ago

Thanks, this is great!

hmitsch commented 5 years ago

/cc @HerminaC

canasdiaz commented 5 years ago

A new SH is published. In order to avoid side effects I've removed all the data sources which are not targeted by the report.

These are the identities we have:

> SELECT COUNT(*) as total, source FROM identities GROUP BY source ORDER BY total DESC;
+--------+-----------------+
| total  | source          |
+--------+-----------------+
| 409008 | bugzillarest    |
| 282170 | github          |
| 166744 | git             |
|  23398 | meetup          |
|  15753 | file:mozillians |
|  15104 | mozillians      |
|   9690 | discourse       |
|   7815 | stackexchange   |
|   6583 | file:march2019  |
|   3405 | file:mozilla    |
|   1229 | remo            |
|    767 | github-commit   |
|      6 | file:test1      |
+--------+-----------------+
13 rows in set (1.72 sec)

In order to be more aggressive with the matching algorithm I've been having a look at the matches per data source with the same name and we need to discard bugzillarest as we have too many matches for the same names. Example: we have many different people named John Smith, with different email addresses.

This is the result offered by the matching:

Total unique identities processed: 614894
Total matches: 45775
Total unique identities after merging: 569119

This new database is named mozilla_sh_filtered_20190409. The indexes will be updated during this night and tomorrow a manual review will be performed to improve the quality of the most active members from the Bugzilla(rest) data source.

Last but not least, remember our SH identities database is a superset of the identities we have in the indexes. Why? we have all the identities collected since we start tracking the Mozilla repos, when some of them were removed the identities were not removed from the SH database (and this is a copy of the original).

CC @hmitsch @HerminaC

hmitsch commented 5 years ago

Thanks for the update! Let's make sure to catch up tomorrow after the manual cleanup.

havardl commented 5 years ago

Hey @sanacl, thanks for this!

Just to be sure, we're currently running matches on identities through a copy of the SH MySQL database itself (which I guess is the superset). Will the changes in the new index also effect the superset, or should we go about this in a different way?

Looking forward to catch up after the manual review!

canasdiaz commented 5 years ago

@havardl You are right. The SH database is the superset.

I don't understand this question: " Will the changes in the new index also effect the superset, or should we go about this in a different way?". The updates done in the SH database will be propagated to the indexes, is this what you asked?

canasdiaz commented 5 years ago

We have a created a copy the mozilla_sh database with improvements on the data. The name of this new database is mozilla_sh_filtered_20190409 and should be the one used for the SOTRAR analysis. Find below a summary of the actions we performed:

identity records from decommissioned data sources (nntp, twitter, etc ..) were removed
in order to be more aggressive with the profile unification data was reviewed to find the best approach. The idea was to merge every identity record with the same name
- getting numbers about the people with the same name and different emails we saw too many matches per name in bugzillarest. This means that merging by name could create wrong profiles as more than 1 person would be unified in the same profile and this would happen a lot of times.
the merge was executed for all the data entries except the ones obtained from bugzillarest. This meant the data could be easily improved if we manually merge the identities for the top Bugzilla members (because they were not merged by our automatic method)
a manual review was performed this morning using the indexes (which were created with the mozilla_sh database). In this review every contributor matching the conditions below was reviewed, in the manual review data was modified in SH database when we found profiles to be merged or wrong profiles that should be unmerged (this also happens sometimes)
- the condition was: top 20 most active members + Community org + last 2 years (for remo and meetup we reviewed the top10)
The following organizations were removed from SH database:
- Mozilla Reps
- Functional Reps
- Code Sheriff (people was moved to Staff, but some of them already were already affiliated to Staff)
- OSSN
Some numbers:
- Total number of unified identities in mozilla_sh: 1535421
- Unified identities after removing the decommissioned data sources: 614894
- Unified identities after merging by name: 569119
- Unified identities after manual review: 568962
Future work:
- If we want to keep improving the affiliation data we should resume the review for the top contributors in the Github Issues data set
To be done:
- get all the enriched indexes updated with this new SH database

CC @hmitsch @HerminaC @havardl

hmitsch commented 5 years ago

Thanks for the detailed summary, @sanacl. HIGHLY appreciated! The "some numbers" section shows the impact of this effort. Impressive!

CC @havardl for visibility.

canasdiaz commented 5 years ago

After the work done to improve the SH database with identities (see https://github.com/mozilla/participation-metrics-org/issues/204#issuecomment-481696965), the job to be done is to get new versions of the indexes. The table below sums up the status of this task:

Index	Ready	% Mozilla Staff before	% Mozilla Staff now
bugzilla	No	41%	?
discourse	:heavy_check_mark:	16%	16%
git[1]	:heavy_check_mark:	54%	58%
git_areas_of_code	No	?	?
github_issues	:heavy_check_mark:	48%	51%
meetup	:heavy_check_mark:	0%	2%
remo-activities	:heavy_check_mark:	14%	14%
remo-events	:heavy_check_mark:	12%	12%
stackoverflow	:heavy_check_mark:	2%	2%

[1] the repo https://github.com/mozilla/gecko.git is not included in the new version of the index. The affiliation numbers were calculated filtering it out.

The Bugzilla index will be ready by tomorrow morning. Hopefully the git_areas_of_code (aoc) too.

@havardl just in case we can start earlier with the ES + SQL queries needed for the enrichment let me know. If not we will start at 6PM as we agreed.

havardl commented 5 years ago

Hey @sanacl, thank you for the overview. Please go ahead with the ES + SQL queries now, as we are only waiting for the indexes to be updated.

canasdiaz commented 5 years ago

After the work done to improve the SH database with identities (see https://github.com/mozilla/participation-metrics-org/issues/204#issuecomment-481696965), the job to be done is to get new versions of the indexes. The table below sums up the status of this task:

Index	Ready	% Mozilla Staff before	% Mozilla Staff now
bugzilla	:heavy_check_mark:	41%	42%
discourse	:heavy_check_mark:	16%	16%
git[1]	:heavy_check_mark:	54%	58%
git_areas_of_code	No	?	?
github_issues	:heavy_check_mark:	48%	51%
meetup	:heavy_check_mark:	0%	2%
remo-activities	:heavy_check_mark:	14%	14%
remo-events	:heavy_check_mark:	12%	12%
stackoverflow	:heavy_check_mark:	2%	2%

[1] the repo https://github.com/mozilla/gecko.git is not included in the new version of the index. The affiliation numbers were calculated filtering it out.

The only missing index is the Git AOC. My proposal is to start generating it on Friday morning so it can be ready to Monday morning, this way you can start getting data from sotrar now. What do u think @havardl ?

havardl commented 5 years ago

Agreed @sanacl, we'll manage without it until Monday.

canasdiaz commented 5 years ago

Info about the missing index Git AOC.

We are having more and more issues. The last one is related to commits which have more than 100K files (yep, a single commit having that huge amount of modified files). Our developers have worked on the fix and it is being applied now. The updates will be resumed at 6PM CEST.

CC @havardl

canasdiaz commented 5 years ago

Hi folks. We are very sorry for this delay but we finally have the Git AOC index completed. Due to the huge number of files included in some commits (sometimes hundred of thousands), we have a 0.34% error in the index which means we don't have data for 3 commits for each thousand. We do think it is an error which won't have a big impact on your study, in any case please let us know what you think @havardl

Some numbers:

Git unique hashes: 2,078,576 https://sotrar.mozilla.community/goto/4a92cb79be17b7eed4183fd4acb569e1
Git AOC unique hashes: 2,071,312 https://sotrar.mozilla.community/goto/1458b9b9730731a7b8e3f1c6ffec5c74
Error = 0.34%

If you are ok with this, this ticket is ready to be sent to Done.

alpgarcia commented 5 years ago

As a side note, ElasticSearch doesn't provide exact cardinality counts: https://www.elastic.co/guide/en/elasticsearch/reference/6.1/search-aggregations-metrics-cardinality-aggregation.html

As you can see, we are on the expected error values for the amount of unique items we are trying to count. Error is actually unknown as we'll never get the same result for both indexes, even if they have the same cardinality (in terms of items aoc contains several millions more).

Cheers!

Alberto.

havardl commented 5 years ago

Confirming that we're okay with this @sanacl. Thanks for the information on cardinality counts @alpgarcia.

canasdiaz commented 5 years ago

Moving to Done and closing.

canasdiaz commented 5 years ago

This is done @hmitsch

mozilla / participation-metrics-org

Increase quality of identities database #204