hrecht / censusapi

R package to retrieve U.S. Census data and metadata via API
https://www.hrecht.com/censusapi/
168 stars 29 forks source link

Deprecate outdated/unnecessary named parameters from getCensus() to fix bug and be in line with package ethos #100

Open hrecht opened 4 months ago

hrecht commented 4 months ago

Several years ago, when there were FAR less Census Bureau API endpoints, I added some optional parameters to getCensus() that were convenience options for some of the economic data endpoints. The package supported the use arbitrary parameters (predicates, in Census-speak) since v0.6.0, released on CRAN in 2019, so this is both unnecessary and unwieldy. Also, catering specifically to certain endpoints in this function is not in scope of the package.

The named parameters are: c("year", "date", "period", "monthly", "category_code", "data_type_code", "naics", "pscode", "naics2012", "naics2007", "naics2002", "naics1997", "sic")

year was added to the list in v0.7.2 when some of the endpoints were swinging back and forth between using time versus year. They are now more consistent and many of the timeseries APIs use lowercase time as a required predicate. YEAR is a variable name in hundreds of the non-timeseries APIs.

The vast majority of this list are actually UPPERCASE predicates in the APIs, not lowercase.

Users will still be able to use the full functionality of the APIs with arbitrary parameters without having these as named optional parameters.

For example, use uppercase YEAR here instead of lowercase year. But really, the preferred syntax now is time, which is the endpoint's true predicate for filtering the timeseries.

# old
saipe_schools <- getCensus(
    name = "timeseries/poverty/saipe/schdist",
    vars = c("SD_NAME", "SAEPOV5_17V_PT", "SAEPOVRAT5_17RV_PT"),
    region = "school district (unified):*",
    regionin = "state:25",
    year = 2022)

# new and valid but not preferred
saipe_schools <- getCensus(
    name = "timeseries/poverty/saipe/schdist",
    vars = c("SD_NAME", "SAEPOV5_17V_PT", "SAEPOVRAT5_17RV_PT"),
    region = "school district (unified):*",
    regionin = "state:25",
    YEAR = 2022)

# preferred
saipe_schools <- getCensus(
    name = "timeseries/poverty/saipe/schdist",
    vars = c("SD_NAME", "SAEPOV5_17V_PT", "SAEPOVRAT5_17RV_PT"),
    region = "school district (unified):*",
    regionin = "state:25",
    time = 2022)
hrecht commented 3 months ago

To test the usage of these named predicates, I retrieved variable metadata for all timeseries and aggregate endpoints with listCensusMetadata(). Here's how often each are used:

> param_vars %>% count(name, sort = T)
             name   n
1            YEAR 299
2       NAICS2012  99
3   category_code  20
4  data_type_code  20
5       NAICS2007  19
6       NAICS2002  17
7       NAICS1997  16
8             SIC  16
9           NAICS  12
10        MONTHLY   7
11           DATE   4
12         PERIOD   3
13           year   3
14  CATEGORY_CODE   1
15 DATA_TYPE_CODE   1
16         PSCODE   1

Note that in almost all cases the predicates are actually uppercase, not lowercase. getCensus() coerces all but data_type_code and category_code to uppercase in the request construction. The timeseries/qwi/sa, timeseries/qwi/se, timeseries/qwi/rh endpoints use lowercase year as a predicate. (Documentation: https://www.census.gov/data/developers/data-sets/qwi.html)

This coercion to uppercase results in the following code failing because the required year predicate truly is lowercase here:

qwi <- getCensus(
    name = "timeseries/qwi/sa",
    vars = "Emp",
    region = "state:02",
    year = 2021,
    quarter = 1)

# Error in apiCheck(req) : 
#   The Census Bureau returned the following error message:
#  error: unknown predicate variable: 'YEAR' 
 # Your API call was:  https://api.census.gov/data/timeseries/qwi/sa?key=[KEY]&get=Emp&for=state%3A02&YEAR=2021&quarter=1

Deprecating these named parameters is now also a bug fix to avoid this error.