kosukeimai / wru

Who Are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
132 stars 31 forks source link

Determining which names are not matched #85

Closed NightSmile96 closed 2 years ago

NightSmile96 commented 2 years ago

I'm not sure if this is the right place to post this... so apologies in advance if so. I have been using the wru package to predict the race of campaign donors and have a dataset with the geographic information, first name, and last name. There are 14,091 observations in my dataset but when I run the race predict command, 1349 last names and 98 first names are not matched. The message I get is the following: "1349 (9.6%) individuals' last names were not matched. 98 (0.7%) individuals' first names were not matched." Is there a way to view which names are not matched? I am wanting to see if there is an issue in the name field.

1beb commented 2 years ago

You can download the name files directly for analysis and directly to your dataset. Problems that we have seen in the past is that some people's data contain names with special characters or more than one name per name-part.

piggyback::pb_download("wru-data-census_first_c.rds", repo = "kosukeimai/wru")
r <- readRDS("wru-data-census_first_c.rds")
r$last_name

This has come up before so we may add an option to "save out" the list that isn't matched. 10% unmatched is pretty reasonable based on my experience with this package typical range is around 10% much more than that would be concerning.

Another alternative is to clone the repository and set a debug at the point where the message is sent, allowing you to inspect the unmatched in the environment.