kosukeimai / wru

Who Are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
132 stars 31 forks source link

Format for surnames #78

Closed mrhumanzee closed 2 years ago

mrhumanzee commented 2 years ago

Forgive me if someone else has already asked this question (I was not able to find any documentation on this matter). Is there a particular format that strings in the surname variable have to be in for _predictrace() to work? E.g. does name capitalization matter? Do special characters matter or should they be removed? Can surnames be hyphenated? Can surnames be more than one word (i.e. double-barreled names)? Do name suffixes matter or should they be removed?

1beb commented 2 years ago

Hi @mrhumanzee You can download the surname file used by the package directly and inspect it:

piggyback::pb_download("wru-data-census_last_c.rds", repo = "kosukeimai/wru")
r <- readRDS("wru-data-census_last_c.rds")
r$last_name

They are all capitalized without any punctuation nor spacing. @solivella or @kosukeimai may know more about how the file was normalized (or it might be in the paper somewhere).