kosukeimai / wru

Who Are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
129 stars 30 forks source link

Data access failure with `census.geo = "tract"` #72

Closed benjamin-chan closed 2 years ago

benjamin-chan commented 2 years ago

I don't have any issues with census.geo = "county" but when I switch to census.geo = "tract" I get a data access failure error.

probabilities <-
  df %>%
  predict_race(surname.only = FALSE,
               surname.year = 2020,
               census.geo = "tract",
               census.key = key,
               age = FALSE,
               sex = FALSE)

returns

Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  InternetOpenUrl failed: 'The connection with the server was reset'
Try census server again: https://api.census.gov/data/2010/dec/sf1?
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  InternetOpenUrl failed: 'The connection with the server was reset'
Try census server again: https://api.census.gov/data/2010/dec/sf1?
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  InternetOpenUrl failed: 'The connection with the server was reset'
Try census server again: https://api.census.gov/data/2010/dec/sf1?
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  InternetOpenUrl failed: 'The connection with the server was reset'
Data access failure at the census website, please try again by re-run the previous command

...

Error in get_census_api_2(data_url, key, get, region, retry)
head() of df: surname state county tract block
O* OR 051 010305 2002
S* OR 039 000403 2021
P* OR 051 001701 2008
R* OR 067 032700 2056
P* OR 051 007900 1022
M* OR 051 001301 2003
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] maps_3.4.0       wru_1.0.0        RODBC_1.3-19     readr_2.1.2
[5] tidyr_1.2.0      dplyr_1.0.9      magrittr_2.0.3   checkpoint_1.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9        PL94171_1.0.2     pillar_1.7.0      compiler_4.1.2
 [5] tools_4.1.2       digest_0.6.29     jsonlite_1.8.0    memoise_2.0.1
 [9] lifecycle_1.0.1   tibble_3.1.7      pkgconfig_2.0.3   rlang_1.0.4
[13] cli_3.3.0         parallel_4.1.2    fastmap_1.1.0     furrr_0.3.0      
[17] stringr_1.4.0     generics_0.1.3    vctrs_0.4.1       globals_0.15.1
[21] hms_1.1.1         tidyselect_1.1.2  glue_1.6.2        listenv_0.8.0
[25] R6_2.5.1          fansi_1.0.3       parallelly_1.32.0 piggyback_0.1.4
[29] tzdb_0.3.0        purrr_0.3.4       codetools_0.2-18  ellipsis_0.3.2   
[33] future_1.26.1     utf8_1.2.2        stringi_1.7.6     cachem_1.0.6
[37] crayon_1.5.1     
1beb commented 2 years ago

Can you try just opening a partial URL in the browser? This looks like a temporary failure.

For example: https://api.census.gov/data/2010/dec/sf1

Do you get a json page?

1beb commented 2 years ago

I'm unable to recreate this. Here's a reprex from my end:

library(wru)
data(voters) # part of the wru package
r <- predict_race(
  voter.file = voters[voters$state == "NJ", ], 
  surname.only = FALSE,
  surname.year = 2020,
  census.geo = "tract",
  census.key = Sys.getenv("CENSUS_API_KEY"), 
  age = FALSE,
  sex = FALSE)
)
benjamin-chan commented 2 years ago

Ah, I see the voters data has state, county, tract, and block as factors. predict_race() works when I convert my columns to factors.

benjamin-chan commented 2 years ago

I take back my comment about factor()ing. It seems like the census.gov API might be throttling the data frame size.

probabilities <-
  df %>%
  # dplyr::sample_n(1600) %>%  # No errors
  dplyr::sample_n(1700) %>%  # Results in Error in file(con, "r") : cannot open the connection
  predict_race(surname.only = FALSE,
               surname.year = 2020,
               census.geo = "tract",
               census.key = key)

Please close the issue if this is unrelated to the wru package.

1beb commented 2 years ago

That's strange, is your census key quite old, or shared? For example, I don't get rate limited running in parallel on a full voter file run (200M+) over the course of a day.

benjamin-chan commented 2 years ago

My key was originally generated July 2020. When I tried to generate a new key using the same email address as my previous key, it gave me the same one. I'll keep troubleshooting the API but would welcome any ideas.

1beb commented 2 years ago

I've got nothing. Mainly because I hammer the API and I've never received a time out. Can you try using use.counties = TRUE to see if that helps? It will limit the census data pull to just those tracts that are within the counties that are in your voter file.

Are you currently in the US?

benjamin-chan commented 2 years ago

Same issue with use.counties = TRUE. I'm on Oregon and filter my data frame with filter(state == "OR"). I also filter out invalid geocodes (zip5 only, intersection, etc.) so it's not passing junk census tracts. I wonder if there's rate limiting on my side (state government agency). I'll try to play around on an Azure VM.

1beb commented 2 years ago

With respect to the census tracts, are you sure they are 2010 tracts? I know that sounds like a strange question but they changed with the decennial census and although some are equivalently named - they are not necessarily the same places and some may not exist. This could be why you're getting strange results (you're submitting tracts that don't exist in the census year you're pulling from).

edit: 2010*, that's the function default for the year argument.

benjamin-chan commented 2 years ago

I tried restricting to records I geocoded in May 2022 and no luck. I also tried with year = "2010" and year = "2020" and get the same issue. Also tried bumping up retry with no luck.

FWIW, I used RedPoint to generate the geocoding. And most of the data had been geocoded late-Nov, early-Dec 2021. My data is from 2019-2021 records.

I should also add that I didn't have an issue with version wru_0.1-12

1beb commented 2 years ago

What's strange to me is that you have a subset that is leading to a failure. Here's what I might try. Run each row, see which ones fail. Show us those rows. There must be something about them that Census API dislikes. My suspicion is mismatched census tracts but you've ruled that out. Here's some sample code that can assist with the investigation:

library(wru)
library(dplyr)

set.seed(42) # let's make sure we can reproduce, if no failures, adjust sample up/down

df <- load_your_df # psuedo-code!
df <- dplyr::sample_n(df, 1700)
error_scanner <- purrr:::map(1:nrow(df), function(x) { 
  tryCatch({ predict_race(
               voter.file = df[i,],
               surname.only = FALSE,
               surname.year = 2020,
               census.geo = "tract",
               census.key = key)}, error = function(e) error)
})

Now we can run over our list to see if error_scanner has anything in it that inherits an error:

rows <- map(error_scanner, function(x) inherits(x, "error"))  %>% unlist() %>% which() 
df[rows, ] # will output the problematic rows
error_scanner[rows] # will output a list of errors hopefully  all the same! 
benjamin-chan commented 2 years ago

The error scanner didn't pick up any errors operating on one row at a time. Here's my code. I don't do package dev or debugging so maybe I'm error scanning wrong.

f <- file("error.txt", open = "wt")
sink(f, type = "message")
test <- df %>% head(1700)  # First 1700 rows results in Error in file(con, "r") : cannot open the connection
predict_race(test, census.geo = "tract", census.key = key, use.counties = TRUE)  # Verify error message
sink()
error_scanner <-
  purrr::map(1:nrow(test),
             function(i) {
               tryCatch({predict_race(test[i, ], census.geo = "tract", census.key = key, use.counties = TRUE)},
                        error = function(e) e)
             })

Output

> rows <- map(error_scanner, function(x) inherits(x, "error")) %>% unlist() %>$
> length(rows)
[1] 0
> df[rows, ] # will output the problematic rows
 [1] surname             geo_result_category race
 [4] ethnicity           age                 record_id
 [7] state               county              tract
[10] block               sex
<0 rows> (or 0-length row.names)
> error_scanner[rows] %>% unique() # will output a list of errors hopefully  a$
list()

Contents of error.txt sink so you can see the call to the API

Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  cannot open URL 'https://api.census.gov/data/2010/dec/sf1?key=f1eed76ebb3f906330f30e4521c55dbebe54094a&get=P005003,P005004,P005005,P005006,P005007,P005008,P005009,P005010&for=county:005,017,051,005,047,049,071,051,019,051,047,067,033,053,019,005,005,033,029,039,065,047,017,029,051,005,051,039,017,033,003,051,067,059,005,051,033,071,039,029,019,053,005,051,067,067,005,061,047,053,017,067,051,043,051,067,039,051,051,047,043,051,053,053,051,067,065,051,039,047,051,033,047,067,047,005,005,019,051,017,047,033,067,059,029,005,033,039,065,051,051,067,019,067,051,051,067,071,051,043,005,005,029,039,053,051,039,039,071,067,041,047,005,067,051,029,047,051,067,071,043,019,047,057,067,067,047,005,067,005,051,007,053,007,065,033,051,005,051,051,047,029,033,051,047,005,047,005,051,043,005,005,051,017,005,011,019,041,067,067,059,039,067,047,047,051,047,047,051,011,047,035,029,051,039,051,067,051,051,033,065,047,039,017,039,019,043,001,029,019,033,051,005,029,039,059,051,009,051,071,029,047,043,053,00 [... truncated]
Try census server again: https://api.census.gov/data/2010/dec/sf1?
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  cannot open URL 'https://api.census.gov/data/2010/dec/sf1?key=f1eed76ebb3f906330f30e4521c55dbebe54094a&get=P005003,P005004,P005005,P005006,P005007,P005008,P005009,P005010&for=county:005,017,051,005,047,049,071,051,019,051,047,067,033,053,019,005,005,033,029,039,065,047,017,029,051,005,051,039,017,033,003,051,067,059,005,051,033,071,039,029,019,053,005,051,067,067,005,061,047,053,017,067,051,043,051,067,039,051,051,047,043,051,053,053,051,067,065,051,039,047,051,033,047,067,047,005,005,019,051,017,047,033,067,059,029,005,033,039,065,051,051,067,019,067,051,051,067,071,051,043,005,005,029,039,053,051,039,039,071,067,041,047,005,067,051,029,047,051,067,071,043,019,047,057,067,067,047,005,067,005,051,007,053,007,065,033,051,005,051,051,047,029,033,051,047,005,047,005,051,043,005,005,051,017,005,011,019,041,067,067,059,039,067,047,047,051,047,047,051,011,047,035,029,051,039,051,067,051,051,033,065,047,039,017,039,019,043,001,029,019,033,051,005,029,039,059,051,009,051,071,029,047,043,053,00 [... truncated]
Try census server again: https://api.census.gov/data/2010/dec/sf1?
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  cannot open URL 'https://api.census.gov/data/2010/dec/sf1?key=f1eed76ebb3f906330f30e4521c55dbebe54094a&get=P005003,P005004,P005005,P005006,P005007,P005008,P005009,P005010&for=county:005,017,051,005,047,049,071,051,019,051,047,067,033,053,019,005,005,033,029,039,065,047,017,029,051,005,051,039,017,033,003,051,067,059,005,051,033,071,039,029,019,053,005,051,067,067,005,061,047,053,017,067,051,043,051,067,039,051,051,047,043,051,053,053,051,067,065,051,039,047,051,033,047,067,047,005,005,019,051,017,047,033,067,059,029,005,033,039,065,051,051,067,019,067,051,051,067,071,051,043,005,005,029,039,053,051,039,039,071,067,041,047,005,067,051,029,047,051,067,071,043,019,047,057,067,067,047,005,067,005,051,007,053,007,065,033,051,005,051,051,047,029,033,051,047,005,047,005,051,043,005,005,051,017,005,011,019,041,067,067,059,039,067,047,047,051,047,047,051,011,047,035,029,051,039,051,067,051,051,033,065,047,039,017,039,019,043,001,029,019,033,051,005,029,039,059,051,009,051,071,029,047,043,053,00 [... truncated]
Try census server again: https://api.census.gov/data/2010/dec/sf1?
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  cannot open URL 'https://api.census.gov/data/2010/dec/sf1?key=f1eed76ebb3f906330f30e4521c55dbebe54094a&get=P005003,P005004,P005005,P005006,P005007,P005008,P005009,P005010&for=county:005,017,051,005,047,049,071,051,019,051,047,067,033,053,019,005,005,033,029,039,065,047,017,029,051,005,051,039,017,033,003,051,067,059,005,051,033,071,039,029,019,053,005,051,067,067,005,061,047,053,017,067,051,043,051,067,039,051,051,047,043,051,053,053,051,067,065,051,039,047,051,033,047,067,047,005,005,019,051,017,047,033,067,059,029,005,033,039,065,051,051,067,019,067,051,051,067,071,051,043,005,005,029,039,053,051,039,039,071,067,041,047,005,067,051,029,047,051,067,071,043,019,047,057,067,067,047,005,067,005,051,007,053,007,065,033,051,005,051,051,047,029,033,051,047,005,047,005,051,043,005,005,051,017,005,011,019,041,067,067,059,039,067,047,047,051,047,047,051,011,047,035,029,051,039,051,067,051,051,033,065,047,039,017,039,019,043,001,029,019,033,051,005,029,039,059,051,009,051,071,029,047,043,053,00 [... truncated]
Data access failure at the census website, please try again by re-run the previous command
https://api.census.gov/data/2010/dec/sf1?key=f1eed76ebb3f906330f30e4521c55dbebe54094a&get=P005003,P005004,P005005,P005006,P005007,P005008,P005009,P005010&for=county:005,017,051,005,047,049,071,051,019,051,047,067,033,053,019,005,005,033,029,039,065,047,017,029,051,005,051,039,017,033,003,051,067,059,005,051,033,071,039,029,019,053,005,051,067,067,005,061,047,053,017,067,051,043,051,067,039,051,051,047,043,051,053,053,051,067,065,051,039,047,051,033,047,067,047,005,005,019,051,017,047,033,067,059,029,005,033,039,065,051,051,067,019,067,051,051,067,071,051,043,005,005,029,039,053,051,039,039,071,067,041,047,005,067,051,029,047,051,067,071,043,019,047,057,067,067,047,005,067,005,051,007,053,007,065,033,051,005,051,051,047,029,033,051,047,005,047,005,051,043,005,005,051,017,005,011,019,041,067,067,059,039,067,047,047,051,047,047,051,011,047,035,029,051,039,051,067,051,051,033,065,047,039,017,039,019,043,001,029,019,033,051,005,029,039,059,051,009,051,071,029,047,043,053,005,019,017,067,017,051,005,005,051,039,051,029,053,005,005,029,067,007,005,047,051,039,051,037,059,005,039,067,017,009,051,051,005,029,051,029,047,033,059,067,029,033,037,051,051,005,005,051,047,053,011,051,051,067,067,005,033,047,051,035,005,005,017,029,019,029,029,027,071,047,037,039,019,009,019,043,029,033,047,051,047,005,043,005,051,001,029,051,029,047,067,039,047,039,051,065,051,017,039,051,033,067,041,005,047,067,029,033,067,005,051,047,035,043,051,039,039,007,067,051,067,067,051,029,067,041,053,005,067,003,039,047,005,047,005,051,033,071,071,047,029,051,033,029,059,029,033,047,061,033,051,051,051,029,067,005,067,005,039,051,017,051,059,067,067,033,029,017,067,067,007,029,005,005,051,029,029,047,045,039,051,047,067,067,029,051,051,029,029,039,051,051,051,029,029,005,071,051,051,051,067,051,051,051,047,067,005,033,029,017,067,005,039,005,051,051,059,005,023,011,011,043,065,019,005,029,051,067,041,047,051,011,051,029,005,071,057,029,009,039,053,039,051,039,039,039,047,043,037,029,051,067,033,039,033,049,005,047,039,047,039,067,017,005,029,067,051,051,005,051,067,051,051,051,029,029,071,029,029,029,039,051,051,039,005,071,011,047,071,049,029,047,019,051,039,061,051,051,051,005,029,005,051,005,051,051,005,005,051,051,051,071,007,029,029,039,039,039,039,029,019,005,041,067,051,039,051,057,039,051,035,049,051,019,047,051,071,047,067,005,051,005,039,071,029,047,051,039,039,039,051,067,067,029,029,067,071,071,005,067,051,071,039,051,029,051,051,067,071,051,067,067,005,005,067,067,005,071,067,067,067,029,067,067,067,065,067,067,005,067,067,029,005,067,057,067,067,067,005,039,039,033,039,051,067,037,015,071,047,051,005,067,067,005,071,005,005,067,065,051,067,067,067,005,051,067,067,051,051,043,003,047,005,051,051,051,051,051,067,067,051,067,067,051,051,067,071,051,051,005,067,067,067,067,067,067,067,005,067,071,051,051,067,007,005,051,051,005,067,039,029,041,059,059,071,071,043,071,023,005,071,071,023,047,019,023,021,039,043,071,051,039,067,043,039,039,071,005,047,017,033,013,039,037,047,029,009,051,039,051,039,067,043,059,011,011,067,067,051,067,005,051,067,067,071,065,071,067,029,051,005,067,005,005,067,029,067,005,067,005,051,029,029,039,039,067,007,051,039,039,011,011,043,019,023,023,047,029,023,049,059,039,051,037,005,029,051,047,051,023,021,059,015,067,067,071,023,071,015,059,057,059,003,051,017,051,067,059,005,029,059,059,023,023,059,019,067,011,069,005,071,047,047,019,019,051,071,067,059,059,071,015,023,059,051,051,039,029,039,039,039,039,047,017,019,005,053,047,005,039,039,051,067,005,051,051,033,051,051,051,051,051,051,005,039,039,051,051,029,051,051,051,005,067,039,039,051,051,051,051,039,009,009,067,009,039,051,039,039,039,071,051,005,005,051,009,043,051,051,041,005,051,067,005,067,051,067,051,067,067,005,051,005,005,047,051,005,005,051,051,067,005,005,005,051,005,051,005,051,005,067,067,067,005,051,039,067,067,017,051,051,067,051,051,005,067,067,067,029,005,067,067,067,005,067,067,067,051,067,067,067,051,039,005,007,051,005,051,039,033,067,029,051,051,051,051,051,051,067,039,029,051,051,039,009,051,039,051,051,039,005,039,051,005,005,005,067,005,005,067,009,051,051,051,005,051,005,051,067,005,005,051,005,051,067,067,005,067,005,005,051,051,067,005,051,051,051,071,005,005,051,053,007,067,067,007,005,059,039,033,039,039,047,071,035,005,039,051,051,039,071,005,051,005,019,039,047,005,005,039,039,005,039,039,033,005,029,027,051,005,047,047,057,039,015,051,009,007,051,067,047,019,013,003,011,051,051,005,071,067,039,039,033,005,071,029,043,005,051,053,057,067,067,015,011,039,067,051,051,039,051,067,047,009,047,015,047,039,003,017,015,029,043,011,039,053,047,011,029,043,043,015,029,041,067,047,067,047,015,015,067,029,071,029,067,011,067,053,039,051,005,005,017,051,067,051,039,067,051,039,039,009,011,051,051,029,039,051,039,001,051,043,005,005,005,005,051,047,043,043,067,051,051,005,071,067,053,051,067,005,005,067,067,005,051,053,051,005,067,067,005,067,047,047,005,067,067,067,005,051,005,051,067,051,067,067,067,051,067,067,051,067,067,041,067,067,029,067,067,067,067,005,067,051,005,067,067,005,067,047,067,029,067,005,047,005,013,067,051,005,005,047,051,051,035,067,067,047,051,033,033,051,067,017,015,047,051,041,005,005,051,051,051,029,071,043,067,067,051,051,051,067,067,067,067,005,005,067,067,067,005,067,067,067,051,067,029,067,005,067,067,067,051,067,067,005,067,051,067,051,051,039,039,017,007,029,039,039,039,039,067,039,039,051,067,051,067,039,067,009,009,009,067,051,039,051,017,067,051,005,005,029,039,051,051,051,039,067,051,039,051,067,051,051,009,051,071,051,067,005,051,067,051,005,047,005,005,005,051,005,005,051,005,067,051,005,051,005,051,005,005,005,005,051,005,005,071,067,067,051,067,039,051,033,005,051,039,039,039,067,005,007,005,015,047,051,059,011,005,067,027,047,051,005,011,051,067,005,003,053,053,051,005,005,051,067,067,001,033,033,029,029,029,005,005,047,059,057,005,067,067,047,015,017,047,047,005,017,067,051,015,067,067,067,067,039,039,051,039,051,005,047,007,047,067,005,051,051,051,015,013,011,071,067,039,011,031,029,015,067,067,067,051,005,067,067,005,051,005,067,067,067,067,067,051,015,039,011,067,067,011,011,051,005,005,005,005,067,007,047,047,039,039,003,019,047,033,067,047,015,015,005,051,005,019,017,039,003,005,067,047,047,011,053,003,011,067,051,051,005,067,047,033,039,011,005,057,011,065,051,051,015,065,065,047,051,067,053,051,011,011,029,071,067,067,051,051,053,029,071,067,047,029,053,017,059,005,041,039,039,067,001,059,057,039,033,067,005,051,015,019,019,071,005,005,051,051,011,053,029,033,005,029,053,067,015,067,029,043,015,015,033,033,047,047,005,005,015,059,059,067,047,033,039,033,029,051,067,067,067,005,067,067,067,051,051,067,067,067,005,051,051,067,051,067,067,029,071,011,011,011,011,065,051,067,011,011,005,011,067,011,051,067,039,011,029,011,043,067,031,001,051,003,029,067,051,033,067,067,067,067,051,011,011,039,039,071,029,005,053,053,053,015,023,049,043,005,005,067,009,009,067,051,051,067,067,067,005,051,029,005,005,051,051,067,067&in=state:41
Error in get_census_api_2(data_url, key, get, region, retry) : 
Warning message:
In sink() : no sink to remove

And here's the counties I have in test

> test %>% pull(county) %>% levels()
 [1] "001" "003" "005" "007" "009" "011" "013" "015" "017" "019" "021" "023"
[13] "025" "027" "029" "031" "033" "035" "037" "039" "041" "043" "045" "047"
[25] "049" "051" "053" "055" "057" "059" "061" "063" "065" "067" "069" "071"
1beb commented 2 years ago

Can you send me a failing sample of your voter.file data to brandon@bertelsen.ca? I'll try it out.

1beb commented 2 years ago

@benjamin-chan I think we have it sorted. A sneaky little issue where your counties aren't 0 padded. This is based on the test data that you have sent me.

library(wru)

predict_race(
  voter.file = read.csv('~/Downloads/test.csv'), # reading your file in straight
  census.geo = "county"
)

# resulting error message
"
Error in census_helper_new(key = census.key, voter.file = voter.file, : 
The following locations in the voter.file are not available in the census 
data (listed as state-county): OR-1, OR-3, OR-5, OR-7, OR-9, OR-11, OR-13, 
OR-15, OR-17, OR-19, OR-21, OR-23, OR-27, OR-29, OR-33, OR-35, OR-37, 
OR-39, OR-41, OR-43, OR-45, OR-47, OR-49, OR-51, OR-53, OR-57, OR-59, 
OR-61, OR-65, OR-67, OR-69, OR-71
"

vf <- read.csv("~/Downloads/test.csv")
vf$county <- formatC(vf$county, width = 3, flag = 0) # adjusting county 0-padding 

predict_race(
  voter.file = vf,
  census.geo = "county"
)

# successful output! 

     surname state county  tract  pred.whi    pred.bla    pred.his     pred.asi   pred.oth
545    SMITH    OR    039   2102 0.9154447 0.016074746 0.010523511 0.0027175710 0.05523950
329    SMITH    OR    017   1400 0.9481373 0.006005211 0.010422092 0.0010769548 0.03435848
523    SMITH    OR    033 360500 0.9385888 0.006368911 0.008798114 0.0009972384 0.04524691
5      SMITH    OR    003    900 0.9216835 0.015526467 0.009213403 0.0058890789 0.04768756
478    SMITH    OR    029    800 0.9260089 0.011265711 0.015494511 0.0015503510 0.04568055
...

I should note too, there was no requirement for factoring anything in the vf object.

1beb commented 2 years ago

I found it. Submitting a PR shortly.

1beb commented 2 years ago

Hi @benjamin-chan thank you for working with me on this. Can you try again after installing the dev branch:

remotes::install_github("kosukeimai/wru", ref = "issue_72")

I was able to reproduce your issue.

benjamin-chan commented 2 years ago

The dev version wru_1.0.0010 solved the issue. Tested on the test.csv and my full 800K row data set. Thanks for working on this issue.