hrecht / censusapi

R package to retrieve U.S. Census data and metadata via API
https://www.hrecht.com/censusapi/
169 stars 30 forks source link

Bug report: Incorrect estimates reported when querying a long variable list #82

Closed vikjam closed 2 years ago

vikjam commented 2 years ago

Describe the bug I think that censusapi is reporting the wrong values when querying a lot of variables at the same time.

To Reproduce This snippet returns the correct values:

library(dplyr)
library(censusapi)

getCensus(
        name = "acs/acs5",
        vintage = "2014",
        vars = c("B01001H_007E", "B01001H_022E"),
        region = "zip code tabulation area:*",
        regionin = "state:*"
    ) %>%
    filter(zip_code_tabulation_area == "07980")
#>   state zip_code_tabulation_area B01001H_007E B01001H_022E
#> 1    34                    07980           15            0

However, when I construct a long variable list, I get incorrect values:

library(dplyr)
library(purrr)
library(glue)
library(tidyr)
library(stringr)
library(censusapi)

subtables <- LETTERS[1:9]
male_tables <- 7:16
female_tables <- 22:31
age_cats <- c(male_tables, female_tables)
sex_by_age_race_tables <- subtables %>%
  map_chr(~ glue("B01001{subtable}", subtable = .x))
sex_by_age_race_tables_df <- expand_grid(
    x = sex_by_age_race_tables,
    y = age_cats
  ) %>%
  mutate(
    y_pad = str_pad(y, 3, pad = "0"),
    table_name = glue("{x}_{y_pad}E")
  ) %>%
  select(!y_pad)
sex_by_age_race_tables_all <- sex_by_age_race_tables_df %>%
  pull(table_name)

getCensus(
        name = "acs/acs5",
        vintage = "2014",
        vars =  sex_by_age_race_tables_all,
        region = "zip code tabulation area:*",
        regionin = "state:*"
    ) %>%
    filter(zip_code_tabulation_area == "07980") %>%
    select(state, zip_code_tabulation_area, B01001H_007E, B01001H_022E)
#>   state zip_code_tabulation_area B01001H_007E B01001H_022E
#> 1    34                    07980          141          187

In particular, it seems like the threshold is around 78 or so variables because this works:

getCensus(
        name = "acs/acs5",
        vintage = "2014",
        vars =  sex_by_age_race_tables_all[102:180],
        region = "zip code tabulation area:*",
        regionin = "state:*"
    ) %>%
    filter(zip_code_tabulation_area == "07980") %>%
    select(state, zip_code_tabulation_area, B01001H_007E, B01001H_022E)
#>   state zip_code_tabulation_area B01001H_007E B01001H_022E
#> 1    34                    07980           15            0

But this doesn't:

getCensus(
        name = "acs/acs5",
        vintage = "2014",
        vars =  sex_by_age_race_tables_all[101:180],
        region = "zip code tabulation area:*",
        regionin = "state:*"
    ) %>%
    filter(zip_code_tabulation_area == "07980") %>%
    select(state, zip_code_tabulation_area, B01001H_007E, B01001H_022E)
#>   state zip_code_tabulation_area B01001H_007E B01001H_022E
#> 1    34                    07980           15            2

Expected behavior I expect the following results based on this table where B01001H_007E = Males 18 and 19 and B01001H_022E = Females 18 and 19.

#>   state zip_code_tabulation_area B01001H_007E B01001H_022E
#> 1    34                    07980           15            0

R session information:

Additional context

For now, I just sliced my variable list into smaller chunks and then combined the results, which seems to work.

Thanks so much for creating and maintaining this package!

mfherman commented 2 years ago

Haven’t had a chance to inspect this closely, but is it perhaps similar to this issue in tidycensus https://github.com/walkerke/tidycensus/pull/165?

vikjam commented 2 years ago

@mfherman Yes! I think it's related to that issue. I think @ottothecow submitted a pull request #73 to fix this. I haven't tried to run this using their forked version.

hrecht commented 2 years ago

Thanks for moving this to github and @mfherman for flagging the tidycensus issue. Ugh these APIs, so inconsistent. In the meantime I recommend just using the built in groups() variable calls or splitting your call into multiple parts. I'll address this as soon as I can but it may be a few weeks.

hrecht commented 2 years ago

Thanks all who have flagged this bug and suggested fixes, and for the patience. I hadn't been able to replicate the issue until the reproducible example was raised here, and I'm sorry it took this long to sort out.

I don't think it was an issue back when the package was originally built in 2015-16 — I think that somewhere along the line Census changed the sort order logic and this popped up as a consequence. (Along with lots of other things that have changed over the years!)

This issue should be fixed in https://github.com/hrecht/censusapi/commit/76bd30e30f038ee8f28d62a8600f91c0f7297226 slated for the new v0.8.0 release (#85). I'll be doing a bunch more testing before pushing this out though. If you have a chance and want to try it out, you can install the latest version using the devtools package with devtools::install_github("hrecht/censusapi"). If you get a chance to test it, let me know if you run into any issues, or if it's working as expected.

hrecht commented 2 years ago

Here's a test that's working as expected, in addition to your example:

library(dplyr)
library(censusapi)
group_B01001 <- listCensusMetadata(
    name = "acs/acs5",
    vintage = 2017,
    type = "variables",
    group = "B01001")

acs_pop_group <- getCensus(
    name = "acs/acs5",
    vintage = 2017,
    vars = "group(B01001)",
    region = "tract:*",
    regionin = "state:02")

acs_pop_manual <- getCensus(
    name = "acs/acs5",
    vintage = 2017,
    vars = c("NAME", "GEO_ID", group_B01001$name),
    region = "tract:*",
    regionin = "state:02")

all_equal(acs_pop_manual, acs_pop_group)
hrecht commented 2 years ago

This appears to be working as expected in v0.8.0 so I'm going to close the issue - if you notice any further issues with data binding please comment back here. Will be on CRAN soon.