theonaunheim / surgeo

Open Source Proxy Demographic module written in Python
MIT License
32 stars 16 forks source link

Reconciling prob_race_given_first_name_harvard and prob_first_name_given_race_harvard probabilities #22

Open whyme314 opened 11 months ago

whyme314 commented 11 months ago

Hi, I've been trying to reconciliation how to switch between the two above mentioned files in the title. Can you confirm the formulation used in your implementation, using the AARON/white entry as example, is

prob(first_name = AARON | race = WHITE) = prob(race = WHITE | first_name = AARON) / sum[prob(race = WHITE | first_name = i]?

I was able to replicate moving from one file to the other using the above formula, and wanted to make sure it is consistent with what you did.

The reason I'm asking is because in the harvard file, the number of observations for each first name is provided, and I used that in my own calculations and arrived at different probabilities. In particular, my formulation is

obs(first name = AARON)prob(race = WHITE | first name = AARON) / sum[obs(first name = i) prob(race = WHITE | first name = i)].

Have you considered this alternative formulation that includes the observation count information? If the choice not to use the observation counts is deliberate, I would love to learn the rationale behind the decision.