mozilla / participation-metrics-org

Participation metrics planning repository
4 stars 4 forks source link

Increase quality of identities database #204

Closed canasdiaz closed 5 years ago

canasdiaz commented 5 years ago

:red_circle: This is high priority

Due to the current matching algorithm we are using to merge user identities we are having people having more than 1 unified identity. @hmitsch already got some examples and numbers that should be improved.

Steps:

In order to make sure we are improving the data we will use a set of checkpoints. Some of them started to be gathered by @hmitsch. We (Bitergia) will collect them and review to be sure data quality is increased.

hmitsch commented 5 years ago

Thanks, this is great!

hmitsch commented 5 years ago

/cc @HerminaC

canasdiaz commented 5 years ago

A new SH is published. In order to avoid side effects I've removed all the data sources which are not targeted by the report.

These are the identities we have:

> SELECT COUNT(*) as total, source FROM identities GROUP BY source ORDER BY total DESC;
+--------+-----------------+
| total  | source          |
+--------+-----------------+
| 409008 | bugzillarest    |
| 282170 | github          |
| 166744 | git             |
|  23398 | meetup          |
|  15753 | file:mozillians |
|  15104 | mozillians      |
|   9690 | discourse       |
|   7815 | stackexchange   |
|   6583 | file:march2019  |
|   3405 | file:mozilla    |
|   1229 | remo            |
|    767 | github-commit   |
|      6 | file:test1      |
+--------+-----------------+
13 rows in set (1.72 sec)

In order to be more aggressive with the matching algorithm I've been having a look at the matches per data source with the same name and we need to discard bugzillarest as we have too many matches for the same names. Example: we have many different people named John Smith, with different email addresses.

This is the result offered by the matching:

This new database is named mozilla_sh_filtered_20190409. The indexes will be updated during this night and tomorrow a manual review will be performed to improve the quality of the most active members from the Bugzilla(rest) data source.

Last but not least, remember our SH identities database is a superset of the identities we have in the indexes. Why? we have all the identities collected since we start tracking the Mozilla repos, when some of them were removed the identities were not removed from the SH database (and this is a copy of the original).

CC @hmitsch @HerminaC

hmitsch commented 5 years ago

Thanks for the update! Let's make sure to catch up tomorrow after the manual cleanup.

havardl commented 5 years ago

Hey @sanacl, thanks for this!

Just to be sure, we're currently running matches on identities through a copy of the SH MySQL database itself (which I guess is the superset). Will the changes in the new index also effect the superset, or should we go about this in a different way?

Looking forward to catch up after the manual review!

canasdiaz commented 5 years ago

@havardl You are right. The SH database is the superset.

I don't understand this question: " Will the changes in the new index also effect the superset, or should we go about this in a different way?". The updates done in the SH database will be propagated to the indexes, is this what you asked?

canasdiaz commented 5 years ago

We have a created a copy the mozilla_sh database with improvements on the data. The name of this new database is mozilla_sh_filtered_20190409 and should be the one used for the SOTRAR analysis. Find below a summary of the actions we performed:

CC @hmitsch @HerminaC @havardl

hmitsch commented 5 years ago

Thanks for the detailed summary, @sanacl. HIGHLY appreciated! The "some numbers" section shows the impact of this effort. Impressive!

CC @havardl for visibility.

canasdiaz commented 5 years ago

After the work done to improve the SH database with identities (see https://github.com/mozilla/participation-metrics-org/issues/204#issuecomment-481696965), the job to be done is to get new versions of the indexes. The table below sums up the status of this task:

Index Ready % Mozilla Staff before % Mozilla Staff now
bugzilla No 41% ?
discourse :heavy_check_mark: 16% 16%
git[1] :heavy_check_mark: 54% 58%
git_areas_of_code No ? ?
github_issues :heavy_check_mark: 48% 51%
meetup :heavy_check_mark: 0% 2%
remo-activities :heavy_check_mark: 14% 14%
remo-events :heavy_check_mark: 12% 12%
stackoverflow :heavy_check_mark: 2% 2%

[1] the repo https://github.com/mozilla/gecko.git is not included in the new version of the index. The affiliation numbers were calculated filtering it out.

The Bugzilla index will be ready by tomorrow morning. Hopefully the git_areas_of_code (aoc) too.

@havardl just in case we can start earlier with the ES + SQL queries needed for the enrichment let me know. If not we will start at 6PM as we agreed.

havardl commented 5 years ago

Hey @sanacl, thank you for the overview. Please go ahead with the ES + SQL queries now, as we are only waiting for the indexes to be updated.

canasdiaz commented 5 years ago

After the work done to improve the SH database with identities (see https://github.com/mozilla/participation-metrics-org/issues/204#issuecomment-481696965), the job to be done is to get new versions of the indexes. The table below sums up the status of this task:

Index Ready % Mozilla Staff before % Mozilla Staff now
bugzilla :heavy_check_mark: 41% 42%
discourse :heavy_check_mark: 16% 16%
git[1] :heavy_check_mark: 54% 58%
git_areas_of_code No ? ?
github_issues :heavy_check_mark: 48% 51%
meetup :heavy_check_mark: 0% 2%
remo-activities :heavy_check_mark: 14% 14%
remo-events :heavy_check_mark: 12% 12%
stackoverflow :heavy_check_mark: 2% 2%

[1] the repo https://github.com/mozilla/gecko.git is not included in the new version of the index. The affiliation numbers were calculated filtering it out.

The only missing index is the Git AOC. My proposal is to start generating it on Friday morning so it can be ready to Monday morning, this way you can start getting data from sotrar now. What do u think @havardl ?

havardl commented 5 years ago

Agreed @sanacl, we'll manage without it until Monday.

canasdiaz commented 5 years ago

Info about the missing index Git AOC.

We are having more and more issues. The last one is related to commits which have more than 100K files (yep, a single commit having that huge amount of modified files). Our developers have worked on the fix and it is being applied now. The updates will be resumed at 6PM CEST.

CC @havardl

canasdiaz commented 5 years ago

Hi folks. We are very sorry for this delay but we finally have the Git AOC index completed. Due to the huge number of files included in some commits (sometimes hundred of thousands), we have a 0.34% error in the index which means we don't have data for 3 commits for each thousand. We do think it is an error which won't have a big impact on your study, in any case please let us know what you think @havardl

Some numbers:

If you are ok with this, this ticket is ready to be sent to Done.

alpgarcia commented 5 years ago

As a side note, ElasticSearch doesn't provide exact cardinality counts: https://www.elastic.co/guide/en/elasticsearch/reference/6.1/search-aggregations-metrics-cardinality-aggregation.html

As you can see, we are on the expected error values for the amount of unique items we are trying to count. Error is actually unknown as we'll never get the same result for both indexes, even if they have the same cardinality (in terms of items aoc contains several millions more).

Cheers!

Alberto.

havardl commented 5 years ago

Confirming that we're okay with this @sanacl. Thanks for the information on cardinality counts @alpgarcia.

canasdiaz commented 5 years ago

Moving to Done and closing.

canasdiaz commented 5 years ago

This is done @hmitsch