mozilla / participation-metrics-org

Participation metrics planning repository
4 stars 4 forks source link

Generate gender information for SH database #202

Closed canasdiaz closed 5 years ago

canasdiaz commented 5 years ago

The aim of this ticket is to calculate the gender for all the identities available in the SortingHat database.

Closing condition:

canasdiaz commented 5 years ago

Hi @havardl, the process to get gender information from identities finished. I would like to review the result but in the meantime have a look at the info we have:

Database changed
MariaDB [mozilla_sh]> select count(*) AS total, gender from profiles GROUP BY gender;
+---------+--------+
| total   | gender |
+---------+--------+
| 1024331 | NULL   |
|   90513 | female |
|  420555 | male   |
+---------+--------+
3 rows in set (1.19 sec)

The check I'm going to perform is to make sure this 1M NULL fields are correct (probably not "real" names" that could not be identified)

Reminder: not all the profiles stored in the database have activity in the data sources and time frame you are analyzing. We had more data sources analyzed until a couple of months ago.

canasdiaz commented 5 years ago

Dear @havardl, we've been checking the data and the result is correct. Data is ready. All the first names for the profiles with this pattern <first_name> <last_name> have been analyzed by the genderize.io API.

The profiles with this pattern are 585K where 75K could not be identified and they are still marked as NULL.

canasdiaz commented 5 years ago

This ticket is ready to be closed, let me know whether you need something else

CC @hmitsch @havardl

canasdiaz commented 5 years ago

Can we close this ticket @hmitsch @havardl?

hmitsch commented 5 years ago

@havardl, you call. :-)

havardl commented 5 years ago

👍