pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.43k stars 17.86k forks source link

DOC: get_dummies behaviour with dummy_na = True is counter-intuitive / incorrect when no NaN values present #59968

Open DM-Berger opened 3 days ago

DM-Berger commented 3 days ago

Pandas version checks

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

Documentation problem

The docs currently read for this function:

Add a column to indicate NaNs, if False NaNs are ignored.

However, when no NaN values are present, a useless constant NaN indicator column is still added:

>>> df = pd.DataFrame(pd.Series([0, 1], dtype="int64"))
>>> df
   0
0  0
1  1
>>> pd.get_dummies(df, columns=[0], dummy_na=True, drop_first=True)
   0_1.0  0_nan
0  False  False
1   True  False
>>> pd.get_dummies(df, columns=[0], dummy_na=True)
   0_0.0  0_1.0  0_nan
0   True  False  False
1  False   True  False

This is arguably quite unexpected behaviour, as constant columns do not contain information except in some very rare cases and for specific custom models.

I.e. for almost any kind of model this column will be ignored (but then annoyingly clutter e.g. .feature_importances variables and also perhaps needlessly increase compute times for algorithms that scale significantly with the number of features, but which may not have methods for ignoring constant input features). For data with a lot of binary features, and pipelines or models which also might do e.g. conversion to floating-point dtypes, all these useless extra constant features could also significantly increase memory requirements (especially in multiprocessing contexts).

I imagine the intended design decision is that if you do e.g. df1, df2 = train_test_split(df), and it ends up such that df1 doesn't have any NaN values for some feature "f", but df2 does, then at least with the current implementation, the user can be ensured the following does not raise an AssertionError:

dummies1 = pd.get_dummies(df1, ["f"], dummy_na=True)
dummies2 = pd.get_dummies(df2, ["f"], dummy_na=True)
assert dummies1.columns.tolist() == dummies2.columns.tolist()

But still, that is a pretty strange use-case, as dummification should generally happen right at the beginning on the full data.

In my opinion the default behaviour should to only add a NaN indicator column if NaN values are actually present... . I actually would consider this to be an implementation or design bug, for like 99% of use-cases. But at bare minimum this undesirable and unexpected behaviour should be documented with some reasoning.

Suggested fix for documentation

Just change the docs to:

Add a column to indicate NaNs. If True, A NaN indicator column will be added even if no NaN values are present [insert reasoning here]. If False, NaNs are ignored [actually "ignored" is extremely unclear too, as per this issue].

saldanhad commented 2 days ago

In an ideal modeling scenario one would only proceed to encoding step if they have completed the data cleaning process and no null values are present, in which case the extra column generated is harmless since it has no variance to be picked up as a feature irrespective of the model. Having said that I would not want to remove this feature, because it could be potentially used as a final data prep step to detect if there are any pending nulls present left to be treated.

However, I do agree with adding it to docs. Maybe adding a brief comment to the dummy_na = True example would suffice.

DM-Berger commented 2 days ago

In an ideal modeling scenario one would only proceed to encoding step if they have completed the data cleaning process and no null values are present,

This is not quite right IMO, as there are many ways to deal with NaN values, one of them being to convert them to indicators rather than to interpolate them. I.e. this function can just as much be regarded as a cleaning step.

Another example of where this behaviour is extremely counter-intuitive / unproductive: if a DataFrame X is such that all n features are binary indicator values, but with NaN values being possible and only present in k of the n features, then using pd.get_dummies(X, X.columns, dummy_na=True, drop_first=True) will produce a new DataFrame with n new columns, k of which are all constant and identical to each other.

This is not only extremely counter-intuitive, but if, as you said, encoding is supposed to be done "after cleaning", then this behaviour basically forces the user to clean the data again after encoding, meaning it breaks your ideal modeling scenario data flow as well (necessitating clean -> encode -> clean again).

Of course, it is trivial to do this step yourself using DataFrame.isnull and a pd.concat, but pd.get_dummies would seem to be the function one would want to reach for to do this automatically.

the extra column generated is harmless since it has no variance to be picked up as a feature irrespective of the model

This is unfortunately just not true at all. There is nothing really harmless about e.g. k extra constant and identical columns if the later model is, say, a deep learner (in which case these constant columns will often be converted to floating point of some kind, and then there will be extra gradient computations due to a bunch of useless constants), or if that data is being passed to e.g. a stepwise feature selection model. As I also said in a multiprocessing context, k unexpected useless new features (especially if converting to float) can be a significant amount of extra memory, especially if on a cluster with e.g. 80 cores. This can happen in automated ML pipelines and the like. Whether extra constant and identical columns are harmless or not depends on the use-case and downstream task.

But anyway, I agree this is all fixed by just making the documentation clearer. It would also be nice to know the reasoning though, because I don't think "constant columns are harmless" is really generally true. If this is an assumption baked into pd.get_dummies behaviour, the user should be informed.

rhshadrach commented 2 days ago

In my opinion the default behaviour should to only add a NaN indicator column if NaN values are actually present...

In my opinion, it would be surprising to pass dummy_na=True and not get the corresponding columns. In addition, the proposed behavior would make the schema of the result dependent on the values of the input. This is very undesirable behavior.

I'm strongly opposed here.

DM-Berger commented 2 days ago

@rhshadrach I can definitely understand why people might have different expectations here about the behaviour, but that is why I ultimately filed this as a documentation issue. I wouldn't want behaviour to change for people who expected the current behaviour.

Are you saying you oppose changing the documentation too as well though?

EDIT: And if the above reasoning is that we want pd.get_dummies outputs to not depend on the values of the inputs, this also is currently not the case, since obviously the amount of columns produced by get_dummies does indeed usually depend on the input:

>>> df = DataFrame({"a": [0, 1, 1], "b": [0, 0, 1]})
>>> pd.get_dummies(df, columns=["a", "b"], dummy_na=True, drop_first=True)
   a_1.0  a_nan  b_1.0  b_nan
0  False  False  False  False
1   True  False  False  False
2   True  False   True  False

>>> df.iloc[2, 0] = 2
>>> pd.get_dummies(df, columns=["a", "b"], dummy_na=True, drop_first=True)
   a_1.0  a_2.0  a_nan  b_1.0  b_nan
0  False  False  False  False  False
1   True  False  False  False  False
2  False   True  False   True  False
DM-Berger commented 2 days ago

Perhaps just to support the idea that people might have very different inututions here, e.g. scikit-learn MissingIndicator has the following argument:

features {‘missing-only’, ‘all’}, default=’missing-only’

Whether the imputer mask should represent all or a subset of features.

  • If 'missing-only' (default), the imputer mask will only represent features containing missing values during fit time.
  • If 'all', the imputer mask will represent all features.

I.e. the default behaviour is arguably closer to what I expected. I think a simple documentation change would make this clearer to users of pd.get_dummies.

rhshadrach commented 1 day ago

The addition of

If True, A NaN indicator column will be added even if no NaN values are present [insert reasoning here].

seems unnecessary to me, however no real opposition to its inclusion. I do agree if False NaNs are ignored. could use better clarification; I'd suggest If False, NA values are encoded as all zero.

the amount of columns produced by get_dummies does indeed usually depend on the input

This is in the definition of the operation, so is unavoidable. The point is to avoid it where ever possible.

rhshadrach commented 1 day ago

PRs for improving the documentation are welcome!

saldanhad commented 1 day ago

take