walkerke / tidycensus

Load US Census boundary and attribute data as 'tidyverse' and 'sf'-ready data frames in R
https://walker-data.com/tidycensus
Other
639 stars 99 forks source link

Updating `get_estimates()` for 2020 and later #423

Closed walkerke closed 1 year ago

walkerke commented 2 years ago

@mfherman @szimmer or anyone else who has time to put eyes on it -

Census released the new 2021 Population Estimates yesterday for states (and larger geographies). The only "product" they've released is the Population product, which includes mid-year population and population density as in previous years as well as a bunch of new change-over-time variables.

The wrinkle is that due to a variety of reasons (the pandemic most significantly) Census did not release 2020 estimates last year. Instead, they've bundled those estimates in with the 2021 PEP and added a year suffix to distinguish 2020 and 2021 figures. This will not work with the CRAN version of tidycensus as we check for valid variables (see https://github.com/walkerke/tidycensus/blob/master/R/load_data.R#L543-L550).

I'm working on a fix for this at https://github.com/walkerke/tidycensus/tree/estimates2021. It's pretty crude at the moment but works like:

library(tidycensus)

get_estimates(
  geography = "state",
  year = 2021,
  product = "population",
)
# A tibble: 104 × 4
   NAME                 GEOID variable   value
   <chr>                <chr> <chr>      <dbl>
 1 Oklahoma             40    POP_2021 3986639
 2 Nebraska             31    POP_2021 1963692
 3 Hawaii               15    POP_2021 1441553
 4 South Dakota         46    POP_2021  895376
 5 Tennessee            47    POP_2021 6975218
 6 Nevada               32    POP_2021 3143991
 7 New Mexico           35    POP_2021 2115877
 8 Iowa                 19    POP_2021 3193079
 9 Kansas               20    POP_2021 2934582
10 District of Columbia 11    POP_2021  670050
# … with 94 more rows

output = "wide" works as expected as well. In this approach, product = "population" only returns mid-year population and population density as it does in previous years. I've relaxed the variable restrictions for 2020 and 2021 to allow users to request any of the other change-rate variables individually.

get_estimates(
  geography = "state",
  variables = "PPOPCHG_2021",
  year = 2021
)
# A tibble: 52 × 4
   NAME                 GEOID variable       value
   <chr>                <chr> <chr>          <dbl>
 1 Oklahoma             40    PPOPCHG_2021  0.621 
 2 Nebraska             31    PPOPCHG_2021  0.114 
 3 Hawaii               15    PPOPCHG_2021 -0.713 
 4 South Dakota         46    PPOPCHG_2021  0.933 
 5 Tennessee            47    PPOPCHG_2021  0.796 
 6 Nevada               32    PPOPCHG_2021  0.961 
 7 New Mexico           35    PPOPCHG_2021 -0.0798
 8 Iowa                 19    PPOPCHG_2021  0.138 
 9 Kansas               20    PPOPCHG_2021 -0.0442
10 District of Columbia 11    PPOPCHG_2021 -2.90  
# … with 42 more rows

There are two issues with this approach:

  1. product = "population" typically returns only mid-year population and density. The new estimates have more variables available. Should product = "population" return all variables, or maintain consistency with previous years?
  2. Because Census included both 2020 and 2021 in the 2021 PEP, we now get a year suffix in the variable name. This breaks consistency with previous years and also will likely (I think?) revert back to the old way of doing things in the 2022 PEP.

My lean is to implement the first point by bundling all the available variables in the product (which I don't yet do), and for the second point parsing the year argument when supplied and cleaning up the result to maintain consistency with previous years. It would look like this:

get_estimates(
  geography = "state",
  year = 2021,
  variables = "POP"
)
# A tibble: 104 × 4
   NAME                 GEOID variable   value
   <chr>                <chr> <chr>      <dbl>
 1 Oklahoma             40    POP 3986639
 2 Nebraska             31    POP 1963692
 3 Hawaii               15    POP 1441553
 4 South Dakota         46    POP  895376
 5 Tennessee            47    POP 6975218
 6 Nevada               32    POP 3143991
 7 New Mexico           35    POP 2115877
 8 Iowa                 19    POP 3193079
 9 Kansas               20    POP 2934582
10 District of Columbia 11    POP  670050
# … with 94 more rows

Internally, POP_2021 is inferred from the year argument and fetched from the API. Users who supply POP_2021 directly would not get an error message but we'd clean up the returned result.

Any comments are welcome - I'd rather take time to get this right rather than rush it out (though anyone is welcome to install the from the estimates2021 branch and grab the estimates right away!).

RickPack commented 2 years ago

Exploring to see if I can help and running the example using the repo at https://github.com/walkerke/tidycensus/tree/estimates2021 yielded a variable POP not found error. I see the above references to variables like POP_2021 in https://api.census.gov/data/2021/pep/population/variables.html.

state_pop <- 
  get_estimates(
    geography = "state",
    year = 2021,
    product = "population",
  )
Error: Your API call has errors.  The API message returned is error: error: unknown variable 'POP'.
walkerke commented 2 years ago

@RickPack thanks for taking a look! Did you install with remotes::install_github("walkerke/tidycensus@estimates2021")? That code is working on my end when using tidycensus from that branch.

You're correct about the variable names - right now this should work:

library(tidycensus)

get_estimates(
  geography = "state",
  year = 2021,
  variables = "POP_2021"
)
# A tibble: 52 × 4
   NAME                 GEOID variable   value
   <chr>                <chr> <chr>      <dbl>
 1 Oklahoma             40    POP_2021 3986639
 2 Nebraska             31    POP_2021 1963692
 3 Hawaii               15    POP_2021 1441553
 4 South Dakota         46    POP_2021  895376
 5 Tennessee            47    POP_2021 6975218
 6 Nevada               32    POP_2021 3143991
 7 New Mexico           35    POP_2021 2115877
 8 Iowa                 19    POP_2021 3193079
 9 Kansas               20    POP_2021 2934582
10 District of Columbia 11    POP_2021  670050
# … with 42 more rows
RickPack commented 2 years ago

Great, that worked. Thank you!

I like the way you are leaning. However, I do not know if there would be any negative consequences of returning all the variables when product = "population" is used. I agree that stripping the suffix makes sense given the potential Census API return to not using the suffix.

tabauer23 commented 2 years ago

I would like to poke this, I am attempting to grab data for 2021 vintage year with the following:

NH_population_2021 <- get_estimates(geography = "county", product = "characteristics", breakdown = c("AGEGROUP", "SEX", "HISP", "RACE"), breakdown_labels = TRUE, year = 2021, time_series = TRUE, state = 33, output = "tidy", show_call = TRUE)

This is returning an error, however when I go to the FTP I can find the file here:

https://www2.census.gov/programs-surveys/popest/datasets/2020-2021/counties/asrh/

cc-est2021-alldata-33.csv

Is there any information on the update and when the API call might be updated to support this? current error message given is

"Error: At this time, the only available geographies for 2020 and 2021 population estimates are 'us', 'region', 'division', and 'state'."

if you are looking for assistance with this I can help with a pull request and edits to the code?

walkerke commented 2 years ago

I've contacted Census about this and they've told me that they will put the 2021 estimates on the API, but there is no timeline for completing that project. So I'm kind of in a holding pattern at the moment. I have mulled over pulling the flat files and cleaning them up/returning them but that may be overkill if the data will end up on the API eventually.

I'd absolutely look at a pull request to add some functionality here!

walkerke commented 1 year ago

Just got confirmation on the Census Slack that the PEP is unlikely to be added back to the API. As such I'm going to put some work in to parse the flat files and update get_estimates() so we can continue to use PEP data in tidycensus.

tabauer23 commented 1 year ago

I had created some custom code that would allow you to effectively navigate to the URL to download the flat file directly into R, maybe I can see if I can dig it up and add a pull request to add some of this functionality.

walkerke commented 1 year ago

@tabauer23 that would be great! the key for implementation is that the output mirrors what get_estimates() was delivering for previous years so that the documentation is consistent. We can iterate on that in a branch.

walkerke commented 1 year ago

I have an implementation of this merged to master. The one issue right now: we don't yet have cartographic boundary shapefiles for 2022, which will have the new county definitions for Connecticut. So I'm using 2021 geographies internally but that doesn't work for Connecticut, so I need to fix & keep an eye on that. So I'll hold off on submitting to CRAN until that is resolved.

tabauer23 commented 1 year ago

I am working on implementing a way to use the get_estimates() to figure out the year being >2019 then pulling the flat file in by the FIPS code. I will push some work to a branch for you to check out soon, see if this is worth the extra effort, I did something like this before.

walkerke commented 1 year ago

I needed this for a project so I went ahead and implemented it; I'm going to update the tidycensus documentation then send along to CRAN.

I'll build in full support as the new data files are released for 2022; right now the product argument is not available (nor are breakdowns) but when that data comes out later this year, I'll build in that support again.