kosukeimai / wru

Who Are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
130 stars 30 forks source link

Updated WRU - Different Numbers than Old Version #87

Closed ameier88 closed 5 months ago

ameier88 commented 1 year ago

Hello, I updated from WRU 0.1-12 to WRU 1.0.1 so I could update the probabilities in our CA voter files with 2020 census data. When comparing the new probabilities (using WRU 1.0.1and 2020 census data) to our old probabilities (WRU 0.1-12 and 2010 census data), aggregated numbers are concerningly different. To make sure it wasn't just the new census numbers, I ran the same voter files using WRU 1.0.1 and 2010 census data, which were also very different than when using WRU 0.1-12 and 2010 census data on the same file. For example, Hispanic voters (calculated by summing probabilities at the county level) increased by more than 1 million using WRU 1.0.1 (2010 census data) from when using WRU 0.1-12 (2010 census data).

I also ran the predict_race test from your README file using your sample data set. Using WRU 0.1-12 (2010 census data), I got the exact same probability outputs as your screenshots. When using WRU 1.0.1 (2010 census data), however, the probability outputs were a little different. The test voter file appears to be the same with the exception of voterID 3 surname changing from Valesco to Rivera. See below:

WRU 0.1-12

WRU 0 1-12 _ScreenShot_SCRIPT_surname only = F WRU 0 1-12 _ScreenShot_surname only = F

WRU 1.0.1

WRU 0 1-12 _ScreenShot_SCRIPT_surname only = F WRU 1 1 _ScreenShot_surname only = F

Are these differences expected and maybe a change in methodology? I assumed both new and old versions would have similar, if not the same, outputs when using 2010 census data with the same voter file and the same settings. I appreciate any help. Also attaching my code used with my voter files for both WRU versions in case you catch something I missed.

WRU 0.1-12

WRU 0 1-12 _ScreenShot_AM CODE

WRU 1.0.1

WRU 1 1_ScreenShot_AM CODE

Also, thank you so much for this package!

1beb commented 1 year ago

We did make a change to the calculation of the probabilities. @etrrosenman @solivella @kosukeimai can comment here.

hirsch-sw commented 1 year ago

Hi @ameier88,

On a tangential topic, how did you and your team get the 2020 census data? From what I've been seeing, the sf1 file hasn't come out yet for 2020. I would definitely be interested in other leads/sources/ideas, though.

Also following to see the answer to your original question. My team has had problems with much larger swaths of unmatched names that prior runs and I would be interested to see if this is in any way linked to your problem.

1beb commented 1 year ago

@hirsch-sw the race data has been available for a while: https://www.census.gov/programs-surveys/decennial-census/about/rdo/summary-files.html. This is what we use in the package. image

hirsch-sw commented 1 year ago

@hirsch-sw the race data has been available for a while: https://www.census.gov/programs-surveys/decennial-census/about/rdo/summary-files.html. This is what we use in the package. image

Is it available with age and sex, though?

1beb commented 1 year ago

@hirsch-sw wru 1.0.0+ does not yet support any of the covariates (age, sex, party). Evan suggested that this might be the driving factor behind the differences that OP is showing.

With that said, age and gender by 2020 tract are also available from ACS. If you use tidycensus it's group B01001. https://api.census.gov/data/2020/acs/acs5/groups/B01001.json.

etrrosenman commented 1 year ago

@ameier88 having taken a deeper look into this, I am concerned that we may be failing to condition on voter party based on the structure of your query. As Brandon mentioned, there is some difficulty in the use of covariates right now because the Census has not released the age and sex distributions; while we tried to structure the newest version of WRU to account for this appropriately, this may be an edge case.

To diagnose the issue, would it be possible to rerun your query using the old version of WRU but without passing in the "party" parameter? If those predictions look very similar to the ones from the new version and with the party parameter provided, that will be very informative.

ameier88 commented 1 year ago

@etrrosenman Below are results of rerunning the scripts with old and new wru using 2010 census data for the first 15 California counties for the same voter file. I aggregated by summing probabilities. As you see, new wru is still very different than old wru with no covariates

Compare WRU outputs

Also attaching my scripts and first six voters to compare probability outputs (names removed and IDs cut off) WRU 1.0.1 NO COVARIATES

WRU 1 1_censuspull WRU 1 0 1 _ScreenShot_SCRIPT_NoCovariates WRU 1 0 1 _ScreenShot_NoCovariates

WRU 0.1- 12 NO COVARIATES

WRU 0 1-12 _ScreenShot_SCRIPT_NoCovariates WRU 0 1-12 _ScreenShot_NoCovariates

WRU 0.1- 12 With Party Covariate

WRU 0 1-12 _ScreenShot_SCRIPT_PartyTrue WRU 0 1-12 _ScreenShot_PartyTrue
solivella commented 1 year ago

Hi @ameier88! We think this is related to how we are handling imputation of surnames that do not appear in the census dictionary. Would you mind checking whether the numbers you are seeing differ as substantially from those in the previous version of wru when restricting your voterfile to records with last names that exist in the 2010 census dictionary? Alternatively, you can set impute.missing to FALSE.

ameier88 commented 1 year ago

Hello @solivella! Thank you for the suggestion. I ran wru using a bunch of different arguments and as you will see attached, the impute.missing = FALSE does make a difference but the numbers using 2010 census data are still not close to the old numbers for WRU 0.1-12

Comparing WRU Outputs Counties 1 through 15

1beb commented 5 months ago

I'm going to close this one. There have been a large number of adjustments to the tables that were used and I think that we have addressed these issues. Notably, age and sex are now available in census data and now part of the most recent version of wru.