socialfoundations / folktables

Datasets derived from US census data
MIT License
240 stars 19 forks source link

Basic Example with 2019 gives no key "RELP" found #22

Closed eddiebergman closed 2 years ago

eddiebergman commented 2 years ago

Hello,

Thank you for the very useful dataset generator, it's a great resource! I accidentally changed the year from 2018 to 2019 in the basic example and ran into an issue. Does this mean the available columns from year to year are not consistent?

from folktables import ACSDataSource, ACSEmployment

# Change year from 2018 to 2019
data_source = ACSDataSource(survey_year='2019', horizon='1-Year', survey='person')
acs_data = data_source.get_data(states=["CA"], download=True)
features, label, group = ACSEmployment.df_to_numpy(acs_data)
Traceback (most recent call last):
  File "test_script1.py", line 9, in <module>
    features, label, group = ACSEmployment.df_to_numpy(acs_data)
  File "/home/.../.venv/lib/python3.8/site-packages/folktables/folktables.py", line 88, in df_to_numpy
    res.append(df[feature].to_numpy())
  File "/home/.../.venv/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/.../.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 'RELP'
mrtzh commented 2 years ago

Yes. The Census Bureau does that. See the 2019 ACS documentation, p. 16, Section H:

In previous years, the variable for relationship was RELP. RELP been removed and replaced with the variable RELSHIPP.

This is why folktables makes it easy to update prediction task definitions. In the definition of ACSEmployment, simply change the line containing RELP to RELSHIPP.

eddiebergman commented 2 years ago

Okay, thanks for clarifying!