Closed kspieks closed 2 years ago
I have similar issues.
data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person') acs_data = data_source.get_data(download=True) X, y, group = ACSEmployment.df_to_numpy(acs_data) print(group.shape) # (3236107,) which is different from 2,320,013 in the table in the paper.
Thank in advance for your help.
Hi @kspieks and @LequnWang, thanks for the careful checking of the dataset sizes. The dataset sizes reported in the paper are incorrect. For our US-wide experiments, we used a maximum of 100,000 rows per-state (and randomly subsampled states that exceeded this limit). The numbers in Table 1 incorrectly reference the size of this dataset rather than the full underlying dataset. I'll update the numbers in the paper and add a note about this today.
@kspieks in our experiments we also included Puerto Rico (PR
) since it's present in the ACS data even though it's not technically a US state.
The correct sizes are updated in the datasheet here.
Thanks for catching this! Let me know if you have any more questions.
I have a minor question about the dataset. After pip installing the folktables package, I followed the example code on the readme to try loading the ACSIncome dataset. Page 6 of the paper indicates that ACSIncome should have 1,599,229 rows. However, I'm observing 1,655,429 rows. The filtering steps in the code match what is described on page 21 of the paper. Are there additional filtering steps I'm missing? Please let me know if I've made an error. Code snippet:
Thanks in advance for your help.
Tagging people that are also interested: @romanlutz @britneyting