hrecht / censusapi

R package to retrieve U.S. Census data and metadata via API
https://www.hrecht.com/censusapi/
169 stars 30 forks source link

Bug report:NAICS code labels coerced to NA #68

Closed jwroycechoi closed 4 years ago

jwroycechoi commented 4 years ago

Describe the bug Part of NAICS2017 codes and NAICS2017_LABEL are being coerced to NAs.

To Reproduce

abscs_2017 <- getCensus(name = "abscs",
                        vintage = 2017,
                        vars = c("EMP","EMP_F","EMPSZFI","FIRMPDEMP","FIRMPDEMP_F","YIBSZFI","NAICS2017","NAICS2017_LABEL"),
                        region = "county:*",
                        regionin = "state:20,23,48")
Warning messages:
1: In responseFormat(raw) : NAs introduced by coercion
2: In responseFormat(raw) : NAs introduced by coercion

Expected behavior There should be values in NAICS2017 codes indicating total for all sectors (i.e., 00 rather than 0), ranges (e.g., 31-33, 44-45). Also there should be values returned in NAICS2017_LABEL column.

Conducting the same action step-by-step informed by actual R source code in the package. Expected results can be retrieved using the code below:

## Replicating the Process by getting the raw data ##
abs_api_test <- httr::GET(url = "https://api.census.gov/data/2017/abscs?get=EMP,EMP_F,EMPSZFI,FIRMPDEMP,FIRMPDEMP_F,YIBSZFI,NAICS2017,NAICS2017_LABEL&for=county:*&in=state:48,20,23&key=API_KEY")
abs_api_test <- jsonlite::fromJSON(httr::content(abs_api_test, as = "text"))
colnames(abs_api_test) <- abs_api_test[1,]
abs_api_test <- data.frame(abs_api_test)
abs_api_test <- abs_api_test[-1,]

R session information:

Additional context Add any other context about the problem here.

natalieoshea commented 4 years ago

I'm also getting the "NAs introduced by coercion" error for all string predicates in the Planning Database.

# choose core pdb variables and MOE
pdb_vars <- listCensusMetadata(name = "pdb/tract", vintage = 2020) %>%
  arrange(match(name, c("State", "County", "Tract","Segmentation_Profile")), desc(predicateType), name)

# save pdb data for all NYC counties
pdb <- getCensus(name = "pdb/tract", 
                 vintage = 2020,
                 vars = pdb_vars$name, 
                 region = "tract:*",
                 regionin = "state:36+county:005,047,061,081,085")
hrecht commented 4 years ago

I'll look into these, thanks for flagging.

hrecht commented 4 years ago

@jwroycechoi This is fixed in the new development version 0.7.0. I'll be submitting to CRAN after a testing period - if you get a chance please try it out.

@natalieoshea Your issue is a bit more complicated. I've done some digging and the Census's underlying data in that API endpoint is poorly formatted. For example, the Med_HHD_Inc_ACS_14_18 variable comes with dollar signs and commas attached. I've sent an email to the Census team describing the issues and will let you know if I hear anything. I'll think about ways to parse those columns without breaking other things but it may be out of scope of this package; the real solution will be on the Census Bureau's end. Very frustrating, I'm sorry.

Right now, I'd recommend a different tactic - request just the variables you need, rather than every variable in the dataset. If you need every single data point, you might want to do a bulk file download rather than use the API. Some of those string predicates, for example Flag, are all nulls in the underlying data for this example anyway.

natalieoshea commented 4 years ago

Thanks for following up on that! That makes sense... As I was doing some data wrangling I noticed that many numeric variables were randomly being read as character vectors which was strange. Sounds like there are quite a few issues with the underlying API data. Thanks again for looking into this and creating this wonderful package!

samiaab1990 commented 3 years ago

I'm having the same issue for just one table in the acs/acs5 2019 table for languages languages<- getCensus( key = Sys.getenv("CENSUS_API_KEY"), name = "acs/acs5", vintage = 2019, vars = c("NAME", "group(B16001)"), region = "tract:*", regionin = "state:36")

I get the following warning: In responseFormat(raw) : NAs introduced by coercion

I see the estimates are of an unknown class when they should be numeric I'm assuming. Is this beyond the package and with the API itself?

hrecht commented 3 years ago

Hi @samiaab1990, this is not from the package, it's from the API. This data doesn't appear to be available at the tract level.

The NA message can be ignored, it's saying that blank data is being turned into R NAs. See the original call: https://api.census.gov/data/2019/acs/acs5?key=[yourkey]&get=NAME%2Cgroup(B16001)&for=tract:*&in=state%3A36

You can see that if you run it on the state level you do get some data. The annotation variables are still empty, since there aren't annotations here. https://api.census.gov/data/2019/acs/acs5?key=[yourkey]&get=NAME%2Cgroup%28B16001%29&for=state%3A36

You can always use option show_call = T to see the raw API call so you can paste it into the browser and see if it matches the R output. Also please open new issues for new problems. Thanks!

fangzhou-xie commented 2 years ago

I also get the same warning with the following code (although working with international trade api):

censusapi::getCensus(
    "timeseries/intltrade/imports/hs",
    vars = c("CTY_NAME", "I_COMMODITY", "YEAR", "MONTH", "CON_QY1_MO", "CON_QY1_MO_FLAG"),
    time = "2013-01", CTY_CODE = 1220
  )

It seems that the "FLAG" variable has been coerced into numeric values, which they are not supposed to be. This flag variable will tell me whether the numeric "CON_QY1_MO" is real 0 value, or missing (thus using 0 as placeholder).

Personally, I would rather get the raw data and clean it myself (so that I am sure there is not information lost in the process). I wonder if it is possible to get the raw text for each column when calling getCensus function?

Thank you very much for this package and it makes my life easier to deal with the Census API.

hrecht commented 2 years ago

Hi @fangzhou-xie, this was already fixed in the development version of censusapi, it just isn't on CRAN yet. You can install the latest version using devtools::install_github("hrecht/censusapi"). Then restart R and it should work, using censusapi v0.7.3. I see the correct values in those columns when I run your code on the dev version. Hope that helps.

hrecht commented 2 years ago

Having no column cleanup at all is an interesting idea - I'll look into that for the next version. I don't recall anyone else ever suggesting that since most people do want numeric types, but I'll see if it could work.

fangzhou-xie commented 2 years ago

Thank you very much! The development version did fix this! (I naively thought that the new version has been published on CRAN, given this thread has been around sometime ;)

Of course having cleaned-up data would be nice, but given that the API has lots of messy entries, forcefully formatting columns into numeric might introduce errors. That is what I was worrying about. Having the "raw output" option might be very helpful, if the certain API end point one is working with is messy and they might want to get raw data and clean up themselves.

hrecht commented 2 years ago

This is a volunteer project, I don't get paid for this and can only contribute on nights and weekends. CRAN submission is a huge amount of work and I do it when I have the time.

fangzhou-xie commented 2 years ago

I have several packages on CRAN as well, so I totally understand!

I didn't mean to put pressure or anything and I am very grateful to have such a wonderful package around. All I was saying was that it would be nice (and potentially helpful to others later) to see this option in the future version.

Thanks a lot!