socialfoundations / folktables

Datasets derived from US census data
MIT License
240 stars 19 forks source link

Why restricted to 2014 onwards? #37

Open sehoffmann opened 1 year ago

sehoffmann commented 1 year ago

Dear Authors,

Thanks for curating this amazing dataset. We plan to use it to research continuous distribution shifts across time. For that, having long time-horizons available is very beneficial in order to highlight the shift (and to have enough time-points for extrapolation).

2014 is currently enforced as a hard-threshold in the code. I was wondering about the reason for that? A quick test revealed to me that older years are still accessible at the same API endpoint. Are there any bigger differences in format or data quality, for instance missing variables?

If so, I would be willing to submit a PR to make this older historic data available as well, given that it can be adapted to the current formats. I would be very glad if you could point me to the right directions.

Best Regards from Tübingen

sehoffmann commented 1 year ago

By disabling the check, I was able to download data going back to 2007 without any extra modification. From 2006 downwards, the API endpoint seems to change.

sehoffmann commented 1 year ago

Older PUMS data is available under this endpoint: https://www2.census.gov/programs-surveys/acs/data/pums/2003/

sehoffmann commented 1 year ago

Ok, my understanding is that 2014 was excluded (the issue only really affects 2014) because the PINCP column contains empty strings which fails the string -> float conversion. I will submit a fix soon.

tombewley commented 1 year ago

Hi @sehoffmann, I've just come across this repo and I'm also interested in looking at longer time horizons. I just thought I'd quickly check whether you've been able to use this modified code successfully in your own work? I can see that your PR hasn't yet been merged, but if it's working for you then I may just adopt it in my local copy of the code.

mrtzh commented 6 months ago

Sorry for the slow response. Similar requests came up in the past. The reason we didn't implemented this at first is because some of the attribute encodings change. So while nothing may break loudly, you'd still have to worry about harmonizing feature encodings across different years. This was a task we didn't have sufficient resources to take on.

See, for example, the discussion in https://github.com/socialfoundations/folktables/issues/22

Please let me know if you believe that you have a general solution to this problem. I think this would be certainly nice to have in the package.