kosukeimai / wru

Who Are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
130 stars 31 forks source link

Issue with predictions using only first and last name #88

Closed burnhilla closed 2 years ago

burnhilla commented 2 years ago

Hi, I am trying to predict ethnicity using first name and last name. However, the predictions turn out to be same for people with the same last name, regardless of their first name.

Code below with some output:

Initialization

data <-read_dta("datafile.dta")
data <- rename(data, surname = last_name_new, first= first_name_new)
newdata <- data[which(data$surname=="SMITH"),]
newdata <- newdata[c("first", "surname")]

# the following command does not work and produces a warning message ("Unknown or uninitialised column: `state`. ") 
res <- predict_race(voter.file = newdata, surname.only=F, names.to.use = 'surname, first')

#the following command works but results are weird
res <- predict_race(
  voter.file = newdata, 
  surname.only=T, 
  names.to.use = 'surname, first'
)
Proceeding with first and last name-only predictions...
i All local files already up-to-date!
i All local files already up-to-date!
3 (0%) individuals' first names were not matched.
Warning message:
Unknown or uninitialised column: `state`. 

> head(res)
        first surname pred.whi  pred.bla   pred.his   pred.asi  pred.oth
1   CHERIENCE   SMITH 0.246435 0.4154406 0.03224136 0.02447646 0.2814067
4        GABI   SMITH 0.246435 0.4154406 0.03224136 0.02447646 0.2814067
6       RONNI   SMITH 0.246435 0.4154406 0.03224136 0.02447646 0.2814067
9       SHAWN   SMITH 0.246435 0.4154406 0.03224136 0.02447646 0.2814067
77     SANDRA   SMITH 0.246435 0.4154406 0.03224136 0.02447646 0.2814067
244    AMANDA   SMITH 0.246435 0.4154406 0.03224136 0.02447646 0.2814067
1beb commented 2 years ago

This is a case where the software isn't warning you about a conflicting set of parameters. When you specify surnames.only = TRUE, your other choices don't matter. names.to.use is used with a census.geo.

image

burnhilla commented 2 years ago

Hi! Thanks for your response. However, it still does not provide a solution to my problem which is trying to estimate ethnicity using FIRST and LAST name, and nothing else.

For instance, res <- predict_race(voter.file = newdata) produces an error "Unknown or uninitialised column: state. "

Is there a way around this?

1beb commented 2 years ago

There is not currently a method for "surname, first" without a census geography. You might be able to try to use the raw probabilities by manipulating the merge_names / merge_surnames function.