walkerke / pygris

Use US Census shapefiles in Python (port of the R tigris package)
https://walker-data.com/pygris
MIT License
113 stars 16 forks source link

data.get_census 50vars fixes #1 #8

Closed jbousquin closed 1 year ago

jbousquin commented 1 year ago

Breaks the variables into manageable chunks and then loops over those chunks, and merge results at the end (defaults to the intersection of the columns in both DataFrames, which should be either State, county etc. or GEOID).

jbousquin commented 1 year ago

I wasn't sure how you might want test cases structured. It makes sense to test with a request that returns index differences depending on the number of variables as discussed in the issue. However, I had trouble reproducing index differences across different requests with these and neither of these are 50+ vars (suggest just adding some vars to the second). Sharing my interpretation of those in python in case I'm missing something.

From: walkerke/tidycensus#165

variables = ["B01001_003E", "B01001_004E", "B01001_005E", "B01001_006E", "B01001_007E",
             "B01001_008E", "B01001_009E", "B01001_010E", "B01001_011E", "B01001_012E", 
             "B01001_013E", "B01001_014E", "B01001_015E", "B01001_016E", "B01001_017E", 
             "B01001_018E", "B01001_019E", "B01001_020E", "B01001_021E", "B01001_022E", 
             "B01001_023E", "B01001_024E", "B01001_025E", "B01001_026E", "B25002_002E",
             "B03003_003E"]

test1 = get_census(dataset = "acs/acs5",
                   variables = "B03003_003E",
                   year = 2017,
                   params = {
                             "for": "tract:*",
                             "in": "state:36;county:*",
                            },
                   return_geoid = True)

test2 = get_census(dataset = "acs/acs5",
                   variables = variables,
                   year = 2017,
                   params = {
                             "for": "tract:*",
                             "in": "state:36;county:*",
                            },
                   return_geoid = True)

test3 = test1.merge(test2, left_index=True, right_index=True, how='left', indicator=True)
assert len(test3[test3['_merge'] == 'both']) == len(test3), 'Batch index mis-match'

From: hrecht/censusapi#82

# Group B01001 (001-049E)
estimates = ['0'+ str(z) for z in range(1, 10)]
estimates +=list(range(10, 50))
group_B01001 = ['B01001_0'+ str(v) + 'E' for v in estimates]

acs_pop_group = get_census(dataset = "acs/acs5",
                           variables = group_B01001,
                           year = 2017,
                           params = {
                               "for": "tract:*",
                               "in": "state:02;county:*",
                           },
                           return_geoid = True)

acs_pop_manual = get_census(dataset = "acs/acs5",
                            variables = 'B01001_001E',
                            year = 2017,
                            params = {
                                "for": "tract:*",
                                "in": "state:02;county:*",
                            },
                            return_geoid = True)

# Check they are all equal
comp = acs_pop_group['B01001_001E']== acs_pop_manual['B01001_001E']
comp.value_counts()

# Or assert they are all equal
test_acs_pop_group = acs_pop_group.merge(acs_pop_manual, left_index=True, right_index=True, how='left', indicator=True)
assert len(test3[test3['_merge'] == 'both']) == len(test3), 'Batch index mis-match'
walkerke commented 1 year ago

Thanks! I'll spend some time going through this and doing some checks. Appreciate the PR!