socialfoundations / folktables

Datasets derived from US census data
MIT License
234 stars 20 forks source link

Automatic categories #24

Closed jenno-verdonck closed 2 years ago

jenno-verdonck commented 2 years ago

I noticed the addition of a df_to_pandas method with the categories dictionary as an argument. I added some methods to automatically generate this dictionary using the definitions available here. This functionality is only available for year>=2017 as definition .csv files aren't available for previous years. The code does also not translate PUMA codes as these are only unique in combination with the state code and require an additional file.

jenno-verdonck commented 2 years ago

@AndresAlgaba Thank you for the suggestions. I applied all the changes you suggested to the code.

jenno-verdonck commented 2 years ago

Hi! I really appreciated the contribution. I'd love to see a few changes, if possible, to make sure that this code meets the requirements of the main package codebase. Please see my comments for suggestions. Thanks!

Thanks for the review. I pushed some changes addressing your comments. The generate_categories method has been relocated to the load_acs.py file. This results in the user having to define the features included in the categories dictionary. I also had to include the generate_categories in the __init__.py file to improve user access to the function.

The -99999999999999.0 trick has been simplified but will still be needed if a placeholder definition for non-numeric values is desired (some attributes define their own definition, others don't). I added some comments to clarify the reason behind this trick.

Finaly I improved the docstring and added some basic tests.