Closed aridf closed 2 years ago
Thank you for sharing this issue with us, and for working with wru
! To answer question 1., could you try storing the data object separately, converting it to a data frame (using as.data.frame
), and passing the data frame object to the predict_race
function outside of the pipe? If that gets rid of the issue, it will help us debug it and enable piped code like the one you are using.
There's no state
variable in your data set. Please model your data after data(voters)
. As well, it should be last
not surname
. I think you followed the model of the argument option here.
@solivella two things we can do to help with this in the future:
'state' %in% names(ds)
. Program should stop and warn the user. 'last' %in% names(ds)
(and each option of names.to.use
) to make sure that the right columns are available for each names.to.use option. Program should stop and warn the user. My code creates the "state" column prior to running predict_race
. When I run the same code with "last" instead of "surname", I get:
vf <- vf %>%
rename('block' = 'fips_block_2010',
'county' = 'fips_county_2010',
'tract' = 'fips_tract_2010',
'last' = 'last_name',
'first' = 'first_name') %>%
mutate("state" = 'GA')
vf <- as.data.frame(vf)
vf <- wru::predict_race(
vf,
census.geo = "block",
model = 'fBISG',
names.to.use = 'surname, first'
)
Proceeding with first and last name-only predictions...
Error in predict_race_new(voter.file = voter.file, names.to.use = names.to.use, :
Voter data frame needs to have a column named 'surname' and a column called 'first'
Indeed, data(voters)
indicates that I should have the surname
column, not the last
column.
Do I have the right version? Here is my session info:
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin20.6.0 (64-bit)
Running under: macOS Monterey 12.6
Matrix products: default
LAPACK: /usr/local/Cellar/r/4.2.1_2/lib/R/lib/libRlapack.dylib
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.5.2 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4 tidyr_1.2.0
[6] tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2 readr_2.1.2 wru_1.0.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 lubridate_1.8.0 PL94171_1.0.2 listenv_0.8.0
[5] gitcreds_0.1.1 assertthat_0.2.1 digest_0.6.29 utf8_1.2.2
[9] parallelly_1.32.1 R6_2.5.1 cellranger_1.1.0 backports_1.4.1
[13] reprex_2.0.2 httr_1.4.3 pillar_1.8.0 rlang_1.0.6
[17] curl_4.3.2 googlesheets4_1.0.1 readxl_1.4.1 rstudioapi_0.13
[21] furrr_0.3.1 googledrive_2.0.0 bit_4.0.4 munsell_0.5.0
[25] broom_1.0.0 compiler_4.2.1 modelr_0.1.9 pkgconfig_2.0.3
[29] globals_0.16.1 tidyselect_1.2.0 codetools_0.2-18 fansi_1.0.3
[33] future_1.27.0 crayon_1.5.2 tzdb_0.3.0 dbplyr_2.2.1
[37] withr_2.5.0 piggyback_0.1.4 grid_4.2.1 jsonlite_1.8.0
[41] gtable_0.3.0 lifecycle_1.0.3 DBI_1.1.3 magrittr_2.0.3
[45] scales_1.2.0 vroom_1.5.7 cli_3.4.1 stringi_1.7.8
[49] cachem_1.0.6 fs_1.5.2 xml2_1.3.3 ellipsis_0.3.2
[53] generics_0.1.3 vctrs_0.4.1 gh_1.3.0 tools_4.2.1
[57] bit64_4.0.5 glue_1.6.2 hms_1.1.1 parallel_4.2.1
[61] fastmap_1.1.0 colorspace_2.0-3 gargle_1.2.0 rvest_1.0.3
[65] memoise_2.0.1 haven_2.5.1
I see that there are some conflicting instructions in merge_names documentation. (https://github.com/kosukeimai/wru/blob/ac20b26490ecc51aaa0779c3f169eb4b2970024d/R/merge_names.R#L26) we'll get it sorted out.
I'm having trouble reproducing this error:
library(wru)
library(future)
library(furrr)
library(tidyverse)
plan(multisession)
data(voters)
census <- get_census_data(states="NY", county.list = list(NY = "061"))
# with surname and last, no error
vf <- voters %>%
filter(state == "NY") %>%
mutate(surname = last)
test <- wru::predict_race(
vf,
census.data = census,
census.geo = "block",
model = 'fBISG',
names.to.use = 'surname, first'
)
# missing surname, errors
vf <- voters %>%
filter(state == "NY") %>%
mutate(surname = NULL)
test <- wru::predict_race(
vf,
census.data = census,
census.geo = "block",
model = 'fBISG',
names.to.use = 'surname, first'
)
# missing last (no failure)
vf <- voters %>%
filter(state == "NY") %>%
mutate(last = NULL)
test <- wru::predict_race(
vf,
census.data = census,
census.geo = "block",
model = 'fBISG',
names.to.use = 'surname, first'
)
I tried your example and it worked fine, which suggested to me the problem must be due to hidden differences in the datasets. I was able to get the function to work correctly using the following code:
vf <- vf %>%
rename('block' = 'fips_block_2010',
'county' = 'fips_county_2010',
'tract' = 'fips_tract_2010',
'surname' = 'last_name',
'first' = 'first_name') %>%
mutate("state" = 'GA')
vf <- as.data.frame(vf)
census <- get_census_data(states="GA")
vf <- wru::predict_race(
vf %>%
select(surname, first, block, tract, county, state) %>%
filter(!is.na(surname)),
census.data = census,
census.geo = "block",
model = 'fBISG',
names.to.use = 'surname, first'
)
So the two things I had to do were 1) remove obs with missing surnames and 2) remove all unnecessary columns.
1) might be the desirable behavior, although there's some argument for just using a geography-based probability when all names are absent.
2) Seems like something to be addressed. I'm not exactly sure what part of my dataset caused the function to crash. I will send over a sample of the data for testing via email @1beb.
It looks like this one is resolved. I couldn't reproduce the error with the dataset provided. I'm going to close untill there's a reprex.
I'm trying to run fBISG on a large-ish voter file for the first time since the last update. Great to see how fast it's running, but I keep hitting an error. Here's the code I run:
After downloading all the state data, I get this output:
Two questions about this:
I can't share my entire dataset but here's a glimpse where I've removed the names: