kosukeimai / wru

Who Are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
132 stars 31 forks source link

Return as many as it can without breaking #100

Closed Chris-Larkin closed 12 months ago

Chris-Larkin commented 1 year ago

Hi, love the idea of this package although have had trouble getting it to work. When I pass my data through to the function, it returns an error at some point and then the entire operation breaks. It'd be great if wru::predict_race() could return as many predicted rows as possible, and then leave those it can't predict (for whatever reason) blank. See the data + code below for an minimally-reproducible example:

df <- tibble::tribble(
      ~surname, ~state, ~county, ~tract, ~party_code,
   "ALEXANDER",   "OR",     71L, 30101L,       "NAV",
     "AQUIPEL",   "OR",     71L, 30101L,       "NAV",
     "BABBITT",   "OR",     71L, 30101L,       "NAV",
      "BACKUS",   "OR",     71L, 30101L,       "DEM",
      "BACKUS",   "OR",     71L, 30101L,       "DEM",
      "BARKER",   "OR",     71L, 30101L,       "DEM",
     "BARTMAN",   "OR",     71L, 30303L,       "REP",
     "BARTMAN",   "OR",     71L, 30303L,       "REP",
        "BASS",   "OR",     71L, 30101L,       "DEM",
   "BATTERMAN",   "OR",     71L, 30303L,       "NAV",
   "BATTERMAN",   "OR",     71L, 30303L,       "NAV",
     "BEARDEN",   "OR",     71L, 30101L,       "NAV",
    "BELANDER",   "OR",     71L, 30101L,       "NAV",
        "BELL",   "OR",     71L, 30303L,       "NAV",
         "BEM",   "OR",     71L, 30101L,       "NAV",
     "BENNETT",   "OR",     71L, 30102L,       "NAV",
        "BERG",   "OR",     71L, 30101L,       "NAV",
      "BERGER",   "OR",     71L, 30303L,       "NAV",
      "BESEAU",   "OR",     71L, 30303L,       "NAV",
      "BIERER",   "OR",     71L, 30101L,       "IND",
    "BILLETTE",   "OR",     71L, 30303L,       "IND",
    "BISCHOFF",   "OR",     71L, 30101L,       "NAV",
       "BLATT",   "OR",     71L, 30101L,       "NAV",
     "BOCHART",   "OR",     71L, 30101L,       "NAV",
      "BOWLIN",   "OR",     71L, 30202L,       "NAV",
     "BURGESS",   "OR",     71L, 30303L,       "NAV",
     "BURNETT",   "OR",     71L, 30101L,       "NAV",
     "BURNETT",   "OR",     71L, 30101L,       "NAV",
    "BYE ODEA",   "OR",     71L, 30101L,       "NAV",
    "BYINGTON",   "OR",     71L, 30101L,       "NAV",
     "CARSLEY",   "OR",     71L, 30102L,       "NAV",
  "CARTWRIGHT",   "OR",     71L, 30101L,       "NAV",
       "CATES",   "OR",     71L, 30101L,       "NAV",
    "CHANDLER",   "OR",     71L, 30101L,       "NAV",
    "CHESHIER",   "OR",     71L, 30102L,       "NAV",
    "CISNEROS",   "OR",     71L, 30303L,       "NAV",
         "COE",   "OR",     71L, 30101L,       "NAV",
      "CORREA",   "OR",     71L, 30303L,       "NAV",
      "COSHOW",   "OR",     71L, 30101L,       "NAV",
    "COURTNEY",   "OR",     71L, 30101L,       "NAV",
       "CROFT",   "OR",     71L, 30101L,       "NAV",
   "CROSSLAND",   "OR",     71L, 30101L,       "NAV",
        "CRUZ",   "OR",     71L, 30102L,       "NAV",
     "CULLENS",   "OR",     71L, 30101L,       "NAV",
     "CURRIER",   "OR",     71L, 30101L,       "NAV",
       "DAHME",   "OR",     71L, 30303L,       "DEM",
       "DAHME",   "OR",     71L, 30303L,       "DEM",
       "DAVIS",   "OR",     71L, 30303L,       "NAV",
       "DAVIS",   "OR",     71L, 30101L,       "NAV",
      "DEHART",   "OR",     71L, 30303L,       "NAV",
      "DENMAN",   "OR",     71L, 30101L,       "NAV",
      "DENNIS",   "OR",     71L, 30101L,       "NAV",
   "DILLESHAW",   "OR",     71L, 30101L,       "NAV",
     "DOOTSON",   "OR",     71L, 30101L,       "NAV",
        "EIDE",   "OR",     71L, 30101L,       "NAV",
      "EILERS",   "OR",     71L, 30101L,       "NAV",
       "EKREN",   "OR",     71L, 30101L,       "DEM",
       "ELLIS",   "OR",     71L, 30101L,       "NAV",
    "ERICKSON",   "OR",     71L, 30101L,       "NAV",
    "ESKELSEN",   "OR",     71L, 30101L,       "NAV",
       "EVANS",   "OR",     71L, 30303L,       "NAV",
      "FETTIG",   "OR",     71L, 30102L,       "NAV",
     "FINDLEY",   "OR",     71L, 30102L,       "NAV",
    "FLANAGAN",   "OR",     71L, 30101L,       "DEM",
"FRAYCHINEAUD",   "OR",     71L, 30102L,       "NAV",
        "FREY",   "OR",     71L, 30101L,       "NAV"
)

Which I feed into:

dropme <- wru::predict_race(
  voter.file = df,
  census.geo = "tract",
  census.key = census_api_key, #i have changed this to my key that I know works, e.g. by making calls via `tidycensus`
  party = "party_code")

Which returns this output/error:

County 1 of 1: 071
Proceeding with last name predictions...
ℹ All local files already up-to-date!
Proceeding with Census geographic data at tract level...
Using Census geographic data from provided census.data object...
State 1 of 1: OR
Error in census_helper_new(key = census.key, voter.file = voter.file,  : 
  The following locations in the voter.file are not available in the census data (listed as state-county-tract):
OR-071-030303

I haven't dug much into this specific tract, but it seems to me that a more sensible output would be to just return predictions for those geographies it's possible to do so for, and otherwise leave that row without any prediction. The current default seems to be that if one row doesn't match then i can't get predictions for any row. Seeing as there's no way of knowing which rows will match and which will not before making the call to wru, this leaves the user in the unenviable position of having to attempt each row individually.

Just some thoughts! I love the package and hope I can get it to work for my study. Thank you

1beb commented 1 year ago

@mdblocker

This one I want to work on with you. I also want to include an option to return unmatched vs. a hard failure.

1beb commented 12 months ago

This is now in the dev branch and will be released to CRAN in 3.0. See #120

Chris-Larkin commented 12 months ago

Amazing @1beb!