jtleider / censusdata

Download data from Census API
MIT License
139 stars 29 forks source link

Discrepancy between API results and downloaded table from Census data.gov #44

Open mdiep-cese opened 2 years ago

mdiep-cese commented 2 years ago

I am noticing that there is a discrepancy with the data that is obtained when using this API than the data that is obtained directly from the Census data website (via their table download feature).

The datafield that I am looking at specifically is: B18108 (Disability-related data), but from the look of it, this affects other datafield as well. And I am using county-level data for ASC-1-year 2019.

The following code is used:

result = censusdata.download('acs1', 2019, censusdata.censusgeo([('county', '*')]), datafields)
censusdata.export.exportcsv('censusdata-api.csv', result)

(The datafields are all the variables/values related to B18108)

The data itself is not incorrect, but the values seem to corresponds to the wrong location/county. The first couple rows are correct, but then the subsequent rows are not. One example: the data for Los Angeles county obtained from the API matches the Napa County from the downloaded table. Rock County, WI data from API matches to the Scioto County, OH.

steventrev commented 2 years ago

Can you post a full example? Here's my own that shows B18108_001E is identical to a csv export from data.census.gov at the county level. The tables are not indexed identically (and you should not assume so) but they are not inaccurate.

import pandas as pd
import censusdata as cd

datafields = ['B18108_001E']
result = cd.download('acs1', 2019, cd.censusgeo([('county', '*')]), datafields)
cd.export.exportcsv('censusdata-api.csv', result)
dfcd = pd.read_csv('censusdata-api.csv')
#dfcd.shape #(840, 4)

#Downloaded B18108 table from 2019 ACS1 via https://data.census.gov/cedsci/table?q=B18108%3A%20AGE%20BY%20NUMBER%20OF%20DISABILITIES&g=0100000US%240500000&tid=ACSDT1Y2019.B18108
dfacs = pd.read_csv('ACSDT1Y2019.B18108.csv', skiprows=[1])
dfacs = dfacs[['B18108_001E', 'NAME']]
#df_sub.shape #(840, 2)

df = dfacs.merge(dfcd, on='NAME', suffixes=['_acs', '_cd'])
df['B18108_001E_acs'].equals(df['B18108_001E_cd']) #True
datatalking commented 2 years ago

@steventrev thanks for replying. Now that @jtleider has left are we keeping the data current or is there a list of bugs, features, documentation or other that need doing?

steventrev commented 2 years ago

@datatalking - this package and its documentation continue to work presently. The package can support 2020 data by adding new tables to the /censusdata/variables/ path, which many forks (including my own) have done. I'm a greenhorn in this space, but will support where I can.

datatalking commented 2 years ago

@steventrev are you supporting the censusdata package going forward from your repo, if so I'd like to help collaborate. I'm green to the census data package but have used the data within for years. Hopefully my python and other skills can be of use, I see this package as worth (some) maintaining.

steventrev commented 2 years ago

@datatalking I doubt my capability beyond my refresh of the input files. Would a better course of action be to request the reins from @jtleider?