kosukeimai / wru

Who Are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
130 stars 31 forks source link

Error when using fBISG #86

Closed aridf closed 2 years ago

aridf commented 2 years ago

I'm trying to run fBISG on a large-ish voter file for the first time since the last update. Great to see how fast it's running, but I keep hitting an error. Here's the code I run:

vf <- vf %>%
  rename('block' = 'fips_block_2010',
         'county' = 'fips_county_2010',
         'tract' = 'fips_tract_2010',
         'surname' = 'last_name',
         'first' = 'first_name') %>%
  mutate("state" = 'GA') %>%
  wru::predict_race(
       census.geo = "block",
       model = 'fBISG',
       names.to.use = 'surname, first'
   )

After downloading all the state data, I get this output:

Using `predict_race` to obtain initial race prediction priors with BISG model
Proceeding with first and last name-only predictions...
ℹ Downloading "wru-data-census_last_c.rds"...
  |==========================================================================| 100%
ℹ Downloading "wru-data-first_c.rds"...
  |==========================================================================| 100%
ℹ Downloading "wru-data-last_c.rds"...
  |==========================================================================| 100%
ℹ Downloading "wru-data-mid_c.rds"...
  |==========================================================================| 100%
Proceeding with Census geographic data at block level...
Using Census geographic data from provided census.data object...
State 1 of 1: GA
ℹ All local files already up-to-date!
Error in `[.data.frame`(df2, , names(df1)) : undefined columns selected

Two questions about this:

  1. What might cause this error?
  2. It seems like middle name data is being downloaded here even though I've chosen the 'surname, first' option. It seems to download very fast, but still might be ideal to skip the download when it's not needed.

I can't share my entire dataset but here's a glimpse where I've removed the names:

Rows: 6,302,058
Columns: 23
$ registration_number <dbl> 2, 3, 4, 7, 10, 16, 21, 24, 27, 31, 33, 36, 37, 38, 41…
$ last_name           <chr> "X", "X", "X", "X", "X", "X"…
$ first_name          <chr> "X", "X", "X", "X", "X", "X", …
$ middle_maiden_name  <chr> "ROGER", "G", "D", "A", NA, "GIBSON", "VERDELL", "WELD…
$ name_suffix         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "SR", …
$ address             <chr> "237 MITCHELL RD,MAYSVILLE,GA,30558", "467 HICKORY CRE…
$ voter_status        <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",…
$ gender              <chr> "M", "M", "M", "M", "F", "M", "M", "M", "F", "M", "F",…
$ party_last_voted    <chr> NA, NA, NA, "R", NA, NA, NA, "R", NA, NA, "R", "D", NA…
$ birthyear           <dbl> 1937, 1940, 1954, 1952, 1945, 1937, 1948, 1956, 1935, …
$ date_registration   <date> 1967-10-03, 1958-04-07, 1972-07-06, 1971-08-14, 1963-…
$ date_added          <date> 1995-02-04, 1995-02-04, 1995-02-04, 1995-02-04, 1995-…
$ date_changed        <date> 2019-11-22, 2020-02-10, 2018-11-19, 2019-10-01, 2019-…
$ date_last_contact   <date> 2019-11-05, 2020-02-04, 2018-11-06, 2019-10-01, 2019-…
$ date_last_voted     <date> 2019-11-05, 2020-02-04, 2018-11-06, 2019-04-09, 2019-…
$ geometry            <chr> "c(-83.562004, 34.267868)", "c(-83.54833, 34.308254)",…
$ fips_2010           <chr> "130119703002032", "130119703002000", "130119703002007…
$ fips_state_2010     <chr> "13", "13", "13", "13", "13", "13", "13", "13", "13", …
$ fips_county_2010    <chr> "011", "011", "011", "011", "011", "011", "011", "011"…
$ fips_tract_2010     <chr> "970300", "970300", "970300", "970300", "970300", "970…
$ fips_block_2010     <chr> "2032", "2000", "2007", "2043", "2092", "2028", "2096"…
$ precinct_id_2018    <chr> "Banks,Anderson", "Banks,Anderson", "Banks,Anderson", …
solivella commented 2 years ago

Thank you for sharing this issue with us, and for working with wru! To answer question 1., could you try storing the data object separately, converting it to a data frame (using as.data.frame), and passing the data frame object to the predict_race function outside of the pipe? If that gets rid of the issue, it will help us debug it and enable piped code like the one you are using.

1beb commented 2 years ago

There's no state variable in your data set. Please model your data after data(voters). As well, it should be last not surname. I think you followed the model of the argument option here.

1beb commented 2 years ago

@solivella two things we can do to help with this in the future:

aridf commented 2 years ago

My code creates the "state" column prior to running predict_race. When I run the same code with "last" instead of "surname", I get:

vf <- vf %>%
  rename('block' = 'fips_block_2010',
         'county' = 'fips_county_2010',
         'tract' = 'fips_tract_2010',
         'last' = 'last_name',
         'first' = 'first_name') %>%
  mutate("state" = 'GA')

vf <- as.data.frame(vf)

vf <-  wru::predict_race(
  vf,
  census.geo = "block",
  model = 'fBISG',
  names.to.use = 'surname, first'
)

Proceeding with first and last name-only predictions...
Error in predict_race_new(voter.file = voter.file, names.to.use = names.to.use,  : 
  Voter data frame needs to have a column named 'surname' and a column called 'first'

Indeed, data(voters) indicates that I should have the surname column, not the last column.

Do I have the right version? Here is my session info:

R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin20.6.0 (64-bit)
Running under: macOS Monterey 12.6

Matrix products: default
LAPACK: /usr/local/Cellar/r/4.2.1_2/lib/R/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.2   stringr_1.4.0   dplyr_1.0.9     purrr_0.3.4     tidyr_1.2.0    
 [6] tibble_3.1.8    ggplot2_3.3.6   tidyverse_1.3.2 readr_2.1.2     wru_1.0.1      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9          lubridate_1.8.0     PL94171_1.0.2       listenv_0.8.0      
 [5] gitcreds_0.1.1      assertthat_0.2.1    digest_0.6.29       utf8_1.2.2         
 [9] parallelly_1.32.1   R6_2.5.1            cellranger_1.1.0    backports_1.4.1    
[13] reprex_2.0.2        httr_1.4.3          pillar_1.8.0        rlang_1.0.6        
[17] curl_4.3.2          googlesheets4_1.0.1 readxl_1.4.1        rstudioapi_0.13    
[21] furrr_0.3.1         googledrive_2.0.0   bit_4.0.4           munsell_0.5.0      
[25] broom_1.0.0         compiler_4.2.1      modelr_0.1.9        pkgconfig_2.0.3    
[29] globals_0.16.1      tidyselect_1.2.0    codetools_0.2-18    fansi_1.0.3        
[33] future_1.27.0       crayon_1.5.2        tzdb_0.3.0          dbplyr_2.2.1       
[37] withr_2.5.0         piggyback_0.1.4     grid_4.2.1          jsonlite_1.8.0     
[41] gtable_0.3.0        lifecycle_1.0.3     DBI_1.1.3           magrittr_2.0.3     
[45] scales_1.2.0        vroom_1.5.7         cli_3.4.1           stringi_1.7.8      
[49] cachem_1.0.6        fs_1.5.2            xml2_1.3.3          ellipsis_0.3.2     
[53] generics_0.1.3      vctrs_0.4.1         gh_1.3.0            tools_4.2.1        
[57] bit64_4.0.5         glue_1.6.2          hms_1.1.1           parallel_4.2.1     
[61] fastmap_1.1.0       colorspace_2.0-3    gargle_1.2.0        rvest_1.0.3        
[65] memoise_2.0.1       haven_2.5.1  
1beb commented 2 years ago

I see that there are some conflicting instructions in merge_names documentation. (https://github.com/kosukeimai/wru/blob/ac20b26490ecc51aaa0779c3f169eb4b2970024d/R/merge_names.R#L26) we'll get it sorted out.

1beb commented 2 years ago

I'm having trouble reproducing this error:

library(wru)
library(future)
library(furrr)
library(tidyverse)

plan(multisession)
data(voters)

census <- get_census_data(states="NY", county.list = list(NY = "061"))

# with surname and last, no error
vf <-  voters %>% 
  filter(state == "NY") %>% 
  mutate(surname = last)

test <- wru::predict_race(
  vf,
  census.data = census,
  census.geo = "block",
  model = 'fBISG',
  names.to.use = 'surname, first'
)

# missing surname, errors
vf <-  voters %>% 
  filter(state == "NY") %>% 
  mutate(surname = NULL)

test <- wru::predict_race(
  vf,
  census.data = census,
  census.geo = "block",
  model = 'fBISG',
  names.to.use = 'surname, first'
)

# missing last (no failure)
vf <-  voters %>% 
  filter(state == "NY") %>% 
  mutate(last = NULL)

test <- wru::predict_race(
  vf,
  census.data = census,
  census.geo = "block",
  model = 'fBISG',
  names.to.use = 'surname, first'
)
aridf commented 2 years ago

I tried your example and it worked fine, which suggested to me the problem must be due to hidden differences in the datasets. I was able to get the function to work correctly using the following code:

vf <- vf %>%
  rename('block' = 'fips_block_2010',
         'county' = 'fips_county_2010',
         'tract' = 'fips_tract_2010',
         'surname' = 'last_name',
         'first' = 'first_name') %>%
  mutate("state" = 'GA')

vf <- as.data.frame(vf)

census <- get_census_data(states="GA")

vf <-  wru::predict_race(
  vf %>% 
    select(surname, first, block, tract, county, state) %>%
    filter(!is.na(surname)),
  census.data = census,
  census.geo = "block",
  model = 'fBISG',
  names.to.use = 'surname, first'
)

So the two things I had to do were 1) remove obs with missing surnames and 2) remove all unnecessary columns.

1) might be the desirable behavior, although there's some argument for just using a geography-based probability when all names are absent.

2) Seems like something to be addressed. I'm not exactly sure what part of my dataset caused the function to crash. I will send over a sample of the data for testing via email @1beb.

1beb commented 2 years ago

It looks like this one is resolved. I couldn't reproduce the error with the dataset provided. I'm going to close untill there's a reprex.