kosukeimai / wru

Who Are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
132 stars 31 forks source link

Using WRU offline when working behind a health system firewall #155

Closed elivings1 closed 1 month ago

elivings1 commented 3 months ago

Electronic medical records generally are missing race and ethnicity data for most patients that are in the system. Because of an interest in understanding health outcomes disparities, imputing these data are of great interest. However, because of privacy issues, many health system computers are behind a firewall that do not allow calls to GitHub or the census bureau. When trying to implement wru behind a firewall, I found 2 modifications from normal usage were required.

1) Uploading the name files that wru automatically are placed in a local temp file (wru-data-census-last_c, wru-data-first_c, wru-data-last_c and wru-data-mid_c) into the R working directory and issuing the command options("wru_data_wd" = TRUE). 2) Uploading census files using the statement get_census_data(state = c("CA"), age = FALSE, sex = FALSE, year=2020). I found that the program would not work if age or sex were true. I have high quality data for 3,000 patients who filled out a detailed demographic survey with emphasis on their race and ethnicity. Race and ethnicity (eg Hispanic or non-Hispanic) are reported separately. I coded anyone with an ethnicity as Hispanic irrespective of White or Blac race. Imputed as a census tract level yielded the following percentages of correctly classified patients: Asian-75%, Black-43%, Hispanic-67% and White-90%.

My questions are 1) Was I correct about what is needed to make wru work offline? Especially regarding the inability to use age or sex data. 2) Does anyone know if when working with data where self-reported race and ethnicity data if coding race as Hispanic irrespective of Black or White status is consistent with how the census handles Hispanic ethnicity? 3) What are the expectations for wru’s accuracy when imputing race from names and geocodes? If what we found is mdifferently.an expected, do you have any advice for how we can approach this differently?

1beb commented 2 months ago
  1. There are a few closed issues on the repo that discuss this but you have it straight. Download the files, pop them into your working directory and you'll be good to go after setting the option.

  2. This is also how the census does it now. It is discussed here: https://www.census.gov/newsroom/blogs/random-samplings/2021/08/improvements-to-2020-census-race-hispanic-origin-question-designs.html

  3. My last test against a commercial voterfile had it at ~85% for VRA states (the places where we have the true values). You can use the probabilities to match your target distribution by sampling without replacement. This is non-trivial but useful if you know what your population should "look like" at the end.