Closed AndresAlgaba closed 2 years ago
Continuing on the suggestions in Issue #17, I have added a categories
and dummies
argument to the df_to_pandas
method.
The categories
argument accepts a "nested" dict with the categorical features as keys and an additional dict with the transformations from numerical values to strings as values. I've added a "folktables/acs_categories.py" file where such a preliminary example dict for the ACSIncome data set is available. It is based on Appendix B.1 from the original paper. However, some variables, such as "OCCP," contain too many categories. I wonder whether you have any suggestions on how to deal with this and whether it would be useful to create such a "standardized" dict for each task?
Extending on this, the dummies
argument is a bool to indicate whether dummies should be created for the categorical variables (if categories=None
then nothing happens with dummies=True
). Overall, I feel that both arguments are especially useful for a pandas DataFrame, and I am not sure whether it would be helpful to add these arguments also for the df_to_numpy
method? If so, I could look into adding these to the BasicProblem
object.
I haven't added any testing for these, as it is still experimental. If preferred, I could also work on this in a separate PR, but for now, I leave it here as it builds on the df_to_pandas
method.
I've also added a small example to the README.
Hi Andres, I appreciate the contribution. I just had a chance to look closer into it and I like what you propose.
There's one change I'd suggest. The feature encodings and descriptions in ACSIncome_categories
are from the the US Census documentation and are subject to change. I'd prefer not to hardwire these into the folktables
codebase, since that would mean we'd have to react more strongly to future changes in the US Census data and documentation.
I think ACSIncome_categories
is very valuable, but that it should live in third party space. There are two options I see:
df_to_pandas
examples
and put it in there.In either case, please add a note that these features are subject to change.
Does this sound good?
CC @francesding @millerjohnp @ludwigschmidt - This ties into a somewhat broader design choice. So, please chime in if you have thoughts.
Dear Moritz, thanks to you and the entire team for the fantastic effort on this library and the general contributions to the fairness literature!
I agree with your suggestion. I have placed such an example in the README. Could we merge this example with the content of the ACSIncome_categories.py
in a Jupyter notebook and put it in an example folder referencing the US Census documentation? I would maybe leave the pandas and CSV-saving part in the README as this may be of interest to a general audience.
Do you have any suggestions on grouping some variables, such as "OCCP," which contain too many categories?
As it is indeed a broader design choice, I will wait a few days before continuing to work on this such that the entire team has some time to consider the proposed changes.
It looks like folks had a chance to chime in and @millerjohnp is fine with your suggestion. So please update the pull request accordingly and I will go ahead and merge it. Thanks again for your contribution!
Great! I have added the examples folder and refer to the categories
and dummies
arguments in the df_to_pandas
method as "encoding categorical features". I have also slightly updated the README, most importantly the header by removing one # as I feel it may be a subsection on its own? Please let me know your thoughts about the changes and whether any updates are required.
LGTM - Thanks!
Issue #17 mentions the usefulness of a
df_to_pandas
method in theBasicProblem
object. Based on thedf_to_numpy
code, I have added this method with some tests to check whether it results in identical arrays.