socialfoundations / folktables

Datasets derived from US census data
MIT License
234 stars 20 forks source link

add df_to_pandas #21

Closed AndresAlgaba closed 2 years ago

AndresAlgaba commented 2 years ago

Issue #17 mentions the usefulness of a df_to_pandas method in the BasicProblem object. Based on the df_to_numpy code, I have added this method with some tests to check whether it results in identical arrays.

AndresAlgaba commented 2 years ago

Continuing on the suggestions in Issue #17, I have added a categories and dummies argument to the df_to_pandas method.

The categories argument accepts a "nested" dict with the categorical features as keys and an additional dict with the transformations from numerical values to strings as values. I've added a "folktables/acs_categories.py" file where such a preliminary example dict for the ACSIncome data set is available. It is based on Appendix B.1 from the original paper. However, some variables, such as "OCCP," contain too many categories. I wonder whether you have any suggestions on how to deal with this and whether it would be useful to create such a "standardized" dict for each task?

Extending on this, the dummies argument is a bool to indicate whether dummies should be created for the categorical variables (if categories=None then nothing happens with dummies=True). Overall, I feel that both arguments are especially useful for a pandas DataFrame, and I am not sure whether it would be helpful to add these arguments also for the df_to_numpy method? If so, I could look into adding these to the BasicProblem object.

I haven't added any testing for these, as it is still experimental. If preferred, I could also work on this in a separate PR, but for now, I leave it here as it builds on the df_to_pandas method.

AndresAlgaba commented 2 years ago

I've also added a small example to the README.

mrtzh commented 2 years ago

Hi Andres, I appreciate the contribution. I just had a chance to look closer into it and I like what you propose.

There's one change I'd suggest. The feature encodings and descriptions in ACSIncome_categories are from the the US Census documentation and are subject to change. I'd prefer not to hardwire these into the folktables codebase, since that would mean we'd have to react more strongly to future changes in the US Census data and documentation.

I think ACSIncome_categories is very valuable, but that it should live in third party space. There are two options I see:

In either case, please add a note that these features are subject to change.

Does this sound good?

CC @francesding @millerjohnp @ludwigschmidt - This ties into a somewhat broader design choice. So, please chime in if you have thoughts.

AndresAlgaba commented 2 years ago

Dear Moritz, thanks to you and the entire team for the fantastic effort on this library and the general contributions to the fairness literature!

I agree with your suggestion. I have placed such an example in the README. Could we merge this example with the content of the ACSIncome_categories.py in a Jupyter notebook and put it in an example folder referencing the US Census documentation? I would maybe leave the pandas and CSV-saving part in the README as this may be of interest to a general audience.

Do you have any suggestions on grouping some variables, such as "OCCP," which contain too many categories?

As it is indeed a broader design choice, I will wait a few days before continuing to work on this such that the entire team has some time to consider the proposed changes.

mrtzh commented 2 years ago

It looks like folks had a chance to chime in and @millerjohnp is fine with your suggestion. So please update the pull request accordingly and I will go ahead and merge it. Thanks again for your contribution!

AndresAlgaba commented 2 years ago

Great! I have added the examples folder and refer to the categories and dummies arguments in the df_to_pandas method as "encoding categorical features". I have also slightly updated the README, most importantly the header by removing one # as I feel it may be a subsection on its own? Please let me know your thoughts about the changes and whether any updates are required.

mrtzh commented 2 years ago

LGTM - Thanks!