transportfoundry / censusr

Get data through the US Census API
Other
5 stars 3 forks source link

Error using allgeos for blocks #14

Open dkyleward opened 8 years ago

dkyleward commented 8 years ago

The following code correctly returns a list of household population at the block group level.

blocks_with_population <-  call_census_api(
  "H0100001", names = c("pop_hh"),  geoids = "51775", allgeos = "bg",
  data_source = "sf1")

The following code should return a list of household population at the block level, but instead it returns an error.

blocks_with_population <-  call_census_api(
  "H0100001", names = c("pop_hh"),  geoids = "51775", allgeos = "bl",
  data_source = "sf1")

## Error in names(df) <- response[[1]] : 
##  'names' attribute [1] must be the same length as the vector [0]
gregmacfarlane commented 8 years ago

Maybe this line needs reconstruction for calling from blocks?

josiekre commented 8 years ago

I ran this on the dev version and get a different error. Tracking this down still.

call_census_api(
    "H0100001", names = c("pop_hh"),  geoids = "51775", allgeos = "bl", 
    data_source = "sf1")

## Error in call_api_once(variables_to_get, geoid, allgeos, data_source,  : 
##     client error: (400) Bad Request 
josiekre commented 8 years ago

This seems to be an error at Census API. The URL it's building should work, but it returns an error.

http://api.census.gov/data/2010/sf1?get=H0100001&for=block:*&in=state:51+county:775

The block group version works:

http://api.census.gov/data/2010/sf1?get=H0100001&for=block+group:*&in=state:51+county:775

Based on the examples here, this should work.

josiekre commented 8 years ago

This brings up another point though. We should improve the passing of error messages from Census API directly to R.

josiekre commented 8 years ago

The Developer forum seems to be wigging out right now. I've sent an email to the Census API contact I have. She is working to answer my question via email as there's a licensing issue with the Q&A site.

josiekre commented 8 years ago

This is the message from my Census contact:

We restrict the use of wildcards to prevent very large data pools, in the case of blocks, there are over 100,000 blocks in some counties, and since we allow people to pull 50 variables at a time, we're talking about 5 million cells of data to be pulled in a single API call.

The best way to tell what kind of wildcards can be used right now is to look at the examples.html page in discovery for a given dataset (http://api.census.gov/data/2010/sf1/examples.html) The first example shows the broadest use of wildcards allowed for that each hierarchy. Since you see tract in the example, you can know you must use tract.

With that, we need to decide how we'd like to proceed. Do we also want to restrict this behavior and produce an error, or do we want to query all the tracts in the county, submit each tract query separately, and compile/return?

dkyleward commented 8 years ago

I can see reasons for either approach. Playing nicely with the Census API would suggest that you trap it instead of work around it. If you decide to work around it, it would be nice to include a warning that downloading that many records takes a long time.

Originally, I worked around it by submitting all the block IDs, but I took Greg's advice and just sampled 1500 blocks instead. I didn't really need that many blocks, and the time required was too long.

gregmacfarlane commented 8 years ago

Does this mean the API never returns * blocks when at a county? If we know what their rules are then we can write good error handling.

josiekre commented 8 years ago

You have to specify the tract to get * blocks to return. So yes to your question. If you request * blocks and only the county, you'll get an error.

dkyleward commented 7 years ago

I remain immensely impressed with this package! However, I'm running into a problem. It may be related to this issue, so I'm putting here. I can always move it if desired. Anyway:

Here is a code block that works. I'm pulling size data for three counties in Florida.

library(tidyverse)
library(censusr)

county_fips <- c(
  "12099", # Palm Beach
  "12011", # Broward
  "12086"  # Miami-Dade
)

size_vars<- paste0(
  "B19019_", sprintf("%03d", 1:8), "E"
)
size_names <- c(
  "total",
  paste0("size_", sprintf("%d", 1:7))
)
size_data <- call_census_api(
  variables = size_vars,
  names = size_names,
  geoids = county_fips,
  data_source = "acs",
  year = 2015,
  period = 5
)

This one doesn't work. The only difference is the inclusion of the allgeos option. It's at a tract level, so we're not talking about 100k geographies.

library(tidyverse)
library(censusr)

county_fips <- c(
  "12099", # Palm Beach
  "12011", # Broward
  "12086"  # Miami-Dade
)

size_vars<- paste0(
  "B19019_", sprintf("%03d", 1:8), "E"
)
size_names <- c(
  "total",
  paste0("size_", sprintf("%d", 1:7))
)
size_data <- call_census_api(
  variables = size_vars,
  names = size_names,
  geoids = county_fips,
  allgeos = "tr",
  data_source = "acs",
  year = 2015,
  period = 5
)

The error message is:

Error in `[<-.data.frame`(`*tmp*`, , -1, value = numeric(0)) : 
  replacement has 0 items, need 7088

What makes this potentially different is that this code block also works. It has the allgeos option specified, but it is pulling a different table. This one is income, but pulling workers is also successful.

# get households by income
income_vars <- paste0(
  "B19001_", sprintf("%03d", 1:17), "E"
)
inc_breaks <- c(10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 75, 100, 125, 150, 200)
income_names <- c(
  "total",
  paste0("under_", sprintf("%d", inc_breaks)),
  "over_200"
  )
income_data_orig <- call_census_api(
  variables = income_vars,
  names = income_names,
  geoids = county_fips,
  allgeos = "tr",
  data_source = "acs",
  year = 2015,
  period = 5
)

Pretty much every ACS table should be available at the tract, so I don't think that's the problem. There are only 1219 tracts in the three counties, so that's not it.

dkyleward commented 7 years ago

Nevermind my last comment. It looks as though I was simply trying to pull data from a table that was too stratified. Took me a while to figure it out. I guess that's a plug for #17

josiekre commented 7 years ago

Glad you figured it out. I skimmed your message and said “hmmmm” out loud. I was thinking we’d really have to dig to figure it out.

I cannot think of a programmatic way to error check this kind of thing. Can you, given that you spent the full day thinking about this?

dkyleward commented 7 years ago

Something that might help would be to check the data returned by the API. If all null/NA, throw a user-friendly error:

Error: The Census did not return any data for this combination of variables and geography.

This is probably easy to do. I'd have sent you a pull request if I understood the package/api at all.