socialfoundations / folktables

Datasets derived from US census data
MIT License
234 stars 20 forks source link

Dataset size #12

Closed kspieks closed 2 years ago

kspieks commented 2 years ago

I have a minor question about the dataset. After pip installing the folktables package, I followed the example code on the readme to try loading the ACSIncome dataset. Page 6 of the paper indicates that ACSIncome should have 1,599,229 rows. However, I'm observing 1,655,429 rows. The filtering steps in the code match what is described on page 21 of the paper. Are there additional filtering steps I'm missing? Please let me know if I've made an error. Code snippet:

# download 2018 data for 50 states
state_list = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI',
              'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI',
              'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC',
              'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT',
              'VT', 'VA', 'WA', 'WV', 'WI', 'WY']
data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')
# data is 3207990 rows x 286 columns
data = data_source.get_data(states=state_list, download=True)

features, labels, groups = ACSIncome.df_to_numpy(data)
features.shape  # (1655429, 10)

Thanks in advance for your help.

Tagging people that are also interested: @romanlutz @britneyting

LequnWang commented 2 years ago

I have similar issues.

data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person') acs_data = data_source.get_data(download=True) X, y, group = ACSEmployment.df_to_numpy(acs_data) print(group.shape) # (3236107,) which is different from 2,320,013 in the table in the paper.

Thank in advance for your help.

millerjohnp commented 2 years ago

Hi @kspieks and @LequnWang, thanks for the careful checking of the dataset sizes. The dataset sizes reported in the paper are incorrect. For our US-wide experiments, we used a maximum of 100,000 rows per-state (and randomly subsampled states that exceeded this limit). The numbers in Table 1 incorrectly reference the size of this dataset rather than the full underlying dataset. I'll update the numbers in the paper and add a note about this today.

@kspieks in our experiments we also included Puerto Rico (PR) since it's present in the ACS data even though it's not technically a US state.

The correct sizes are updated in the datasheet here.

Thanks for catching this! Let me know if you have any more questions.