socialfoundations / folktables

Datasets derived from US census data
MIT License
234 stars 20 forks source link

Postprocessing step np.nan_to_num #39

Open cruyffturn opened 8 months ago

cruyffturn commented 8 months ago

Hello thank you for providing us with this package. In the post processing step of acs.ACSIncome np.nan_to_num is applied with the second argument -1. Second argument of np.nan_to_num is [copy] (https://numpy.org/doc/stable/reference/generated/numpy.nan_to_num.html) which is a boolean variable. Does -1 implies copy=False or the goal was to replace missing values with -1?

N-Masi commented 8 months ago

-1 implies copy=True (can test this with assert bool(-1)). copy=True is the intended behavior for the postprocessing by the authors as you can see on folktables.py:101. This leads me to believe it is coincident that -1 as the second arg causes a different intended behavior (copying) and that the authors meant for nan's to be replaced by -1, which would be achieved with lambda x: np.nan_to_num(x, nan=-1) (this by default also has copying)

The same postprocess (lambda x: np.nan_to_num(x, -1)) is applied to all BasicProblem's defined in acs.py which would changes the results of using the package. When you change the postprocessing of ACSEmployment to nan=-1 the equality of opportunity violation reported in the first README example for Alabama changes from 0.0871 to 0.0789.

Because the values of ACS variables are largely categorical I think there should maybe be more discussion around what to do with NaN values rather than just replace with -1. Perhaps entries with NaN values for some variable should be thrown out? If the answer is to just let the user decide then discussion of this question and alternative postprocessing functions with their implications is definitely warranted in the README.

mrtzh commented 4 months ago

Right, I just stumbled over this while reviewing https://github.com/socialfoundations/folktables/pull/41

If I had to guess what happened, I'd say that the way we called nan_to_num was unintended. It has the effect that NaNs were replaced with 0 and not with -1.

I'll have to think more about what's the best solution for dealing with nans. Given that this package has been used for quite some time, my inclination might be to continue to replace NaNs with 0 to ensure replicability of various results that might depend on it.

AndreFCruz commented 4 months ago

Personally I think it'd make sense to keep the previous unintended behavior (replace with 0s instead of -1s) to maintain reproducibility of results 😅

One potential issue with replacing NaNs with 0s is that 0 is a valid value that is present in multiple columns.

I looked into it and only ACSHealthInsurance seems to have this problem in a significant part of the dataset (17% NaNs and 10% 0s are mapped into the same 0 representation). Other tasks have plenty of NaNs but no 0s.

See the last cell of this notebook for a more detailed investigation of which tasks and columns are affected.

Since only one task seems to be affected by this I suggest keeping the previous behavior and adding a warning in the package repository. Also, the implicit NaN-to-0 mapping should perhaps be made explicit with nan_to_num(x, copy=True, nan=0.0) instead of hiding behind the default value of nan_to_num.

ericvd-ucb commented 4 months ago

FWIW the teachability of this is awesome. If we could find some results that changed due to the treatment of the nans that would be so great for a teaching lesson eg on data processing decisions affecting results.