hrecht / censusapi

R package to retrieve U.S. Census data and metadata via API
https://www.hrecht.com/censusapi/
169 stars 30 forks source link

Issues when pulling multiple vars from ACS API at ZCTA level #57

Closed zebryjiang closed 4 years ago

zebryjiang commented 4 years ago

Hi I found some errors when pulling multiple variables from ACS 5 yr data (11-16), and I believe this cause is that the census does not return the estimates in the same order at ZCTA level. And since the package just does a simple colbind for the variables, some of the estimates are mis-matched with the geoid's.

I tried to pull each indicator separately and merge them together by their geoid, everything looks fine then.

hrecht commented 4 years ago

Please provide reproducible code so I can recreate this issue, thanks.

hrecht commented 4 years ago

I've been looking into this issue and trying to recreate it without luck. I would like to fix it if it's still occurring, so if anyone else has run into data joining mismatches please share your code.

I've tried this in several ways, here's an example:

library(censusapi)
library(dplyr)

acs_groups <- listCensusMetadata(
    name = "acs/acs5",
    vintage = 2018,
    type = "groups")

group_vars <- listCensusMetadata(
    name = "acs/acs5",
    vintage = 2018,
    group = "B01001")

acs_group <-  getCensus(
    name = "acs/acs5",
    vintage = 2018,
    vars = "group(B01001)",
    region = "zip code tabulation area:*"
)

acs_manual <-  getCensus(
    name = "acs/acs5",
    vintage = 2018,
    vars = group_vars$name,
    region = "zip code tabulation area:*"
)
acs_group <- acs_group %>% arrange(zip_code_tabulation_area)
acs_manual <- acs_manual %>% arrange(zip_code_tabulation_area)
acs_group <- acs_group[,order(colnames(acs_group))]
acs_manual <- acs_manual[,order(colnames(acs_manual))]

# Difference in columns
setdiff(colnames(acs_group), colnames(acs_manual))
acs_group <- acs_group %>%  select(-GEO_ID, -NAME)

# Are they the same
all_equal(acs_group, acs_manual)
hrecht commented 4 years ago

I'm closing since there hasn't been followup and I haven't been able to replicate this issue, but if anyone runs into it again please post reproducible code so I can test.

ottothecow commented 3 years ago

I saw this issue report and it worried me so I tried to do more testing. So far, I have not been able to replicate either.

I saw that @hrecht's attempt to replicate relied on pulling an entire group at once--I wondered if maybe the issue could only appear if grabbing variables from different groups. Perhaps some groups return in a different order, or are missing certain values in which case a column bind wouldn't work.

I tried running this code with a different choices of variable, but it still worked out OK.

combo <- getCensus(name="acs/acs5", vintage = "2016",
                   vars=c("NAME", "B15003_001E","B07001_044E", "C27018_026E" ), 
                   region="zip code tabulation area:*")
one   <- getCensus(name="acs/acs5", vintage = "2016",
                   vars=c("NAME"), 
                   region="zip code tabulation area:*")
two   <- getCensus(name="acs/acs5", vintage = "2016",
                   vars=c("B15003_001E"), 
                   region="zip code tabulation area:*")
three <- getCensus(name="acs/acs5", vintage = "2016",
                   vars=c("B07001_044E"), 
                   region="zip code tabulation area:*")
four  <- getCensus(name="acs/acs5", vintage = "2016",
                   vars=c("C27018_026E" ), 
                   region="zip code tabulation area:*")

all = one %>%
  left_join(two) %>%
  left_join(three) %>%
  left_join(four)

all_equal(combo,all)

I couldn't find any combination of variables where the separate pulls merged by zcta didn't match the combined pull. Maybe there is some way to come up with a request that returns in a different order, but I couldn't find it.

However, I did notice that if you change the year, the API does not return in the same order. Is it possible that @zebryjiang was pulling from different years without realizing it? That would at least explain the issue with zctas being out of order. Otherwise more information is needed to replicate.

> test16 <- getCensus(name="acs/acs5", vintage = "2016",
+                    vars=c("C27018_026E" ), 
+                    region="zip code tabulation area:*")
> test14 <- getCensus(name="acs/acs5", vintage = "2014",
+                     vars=c("C27018_026E" ), 
+                     region="zip code tabulation area:*")
> head(test14)
  zip_code_tabulation_area C27018_026E
1                    01062        1405
2                    01063           0
3                    01066          NA
4                    01068         319
5                    01069        1741
6                    01070         152
> head(test16)
  zip_code_tabulation_area C27018_026E
1                    37064        7315
2                    37066        9506
3                    37067        2813
4                    37069        1986
5                    37072        6681
6                    37073        3499
ottothecow commented 3 years ago

I take that back--I found an example of this issue in action. I don't think I was pulling enough variables at once.

B01001_001E and B02001_001E should both be total population. However when I was pulling a lot of different variables, I noticed that they didn't match.

library(censusapi)
library(dplyr)

getvars = c('NAME',
#Age/Pop variables
'B01001_001E',
'B01001_002E','B01001_003E','B01001_004E','B01001_005E','B01001_006E',
'B01001_007E','B01001_008E','B01001_009E','B01001_010E','B01001_011E',
'B01001_012E','B01001_013E','B01001_014E','B01001_015E','B01001_016E',
'B01001_017E','B01001_018E','B01001_019E','B01001_020E','B01001_021E',
'B01001_022E','B01001_023E','B01001_024E','B01001_025E','B01001_026E',
'B01001_027E','B01001_028E','B01001_029E','B01001_030E','B01001_031E',
'B01001_032E','B01001_033E','B01001_034E','B01001_035E','B01001_036E',
'B01001_037E','B01001_038E','B01001_039E','B01001_040E','B01001_041E',
'B01001_042E','B01001_043E','B01001_044E','B01001_045E','B01001_046E',
'B01001_047E','B01001_048E','B01001_049E',
#Race
'B02001_001E','B02001_002E','B02001_003E','B02001_004E','B02001_005E',
'B02001_006E','B02001_007E','B02001_008E'
)

allvars <- getCensus(name="acs/acs5", vintage = "2015",
                   vars=getvars, 
                   region="zip code tabulation area:*")

one   <- getCensus(name="acs/acs5", vintage = "2015",
                   vars=c("NAME"), 
                   region="zip code tabulation area:*")
two   <- getCensus(name="acs/acs5", vintage = "2015",
                   vars=c('B01001_001E'), 
                   region="zip code tabulation area:*")
three <- getCensus(name="acs/acs5", vintage = "2015",
                   vars=c('B02001_001E'), 
                   region="zip code tabulation area:*")

selectvars = allvars %>%
  select(zip_code_tabulation_area, NAME, B01001_001E, B02001_001E)

all = one %>%
  left_join(two) %>%
  left_join(three)

all_equal(selectvars,all)

These results are completely wrong. Looking at them for a single ZCTA, it is clear something is up:

> selectvars[selectvars$zip_code_tabulation_area=='60613',]
      zip_code_tabulation_area        NAME B01001_001E B02001_001E
28396                    60613 ZCTA5 60613       49389        8859
> all[all$zip_code_tabulation_area=='60613',]
      zip_code_tabulation_area        NAME B01001_001E B02001_001E
28396                    60613 ZCTA5 60613       49389       49389

The combined pull is incorporating results from a different ZCTA. Looking for that population in the data suggests that instead of reporting the data for 60613, it is actually reporting the data for 55810:

> all[all$B01001_001E==8859,]
     zip_code_tabulation_area        NAME B01001_001E B02001_001E
1926                    55810 ZCTA5 55810        8859        8859

@zebryjiang is probably right that the census is returning things in a different order, and they are just getting bound back together when they should be joined by geoid.

I note that in this example, I am pulling more than 50 variables, and they come from different tables. I think the issue arises here: https://github.com/hrecht/censusapi/blob/b1d720e378750830b8b49e6f582ac83be195fdf5/R/getcensus_functions.R#L227 but I don't immediately know how to resolve it. Instead of cbind it should be a merge/join, but I am not sure what the most efficient way to do this is in Base R.