Open DM-Berger opened 3 days ago
In an ideal modeling scenario one would only proceed to encoding step if they have completed the data cleaning process and no null values are present, in which case the extra column generated is harmless since it has no variance to be picked up as a feature irrespective of the model. Having said that I would not want to remove this feature, because it could be potentially used as a final data prep step to detect if there are any pending nulls present left to be treated.
However, I do agree with adding it to docs. Maybe adding a brief comment to the dummy_na = True
example would suffice.
In an ideal modeling scenario one would only proceed to encoding step if they have completed the data cleaning process and no null values are present,
This is not quite right IMO, as there are many ways to deal with NaN values, one of them being to convert them to indicators rather than to interpolate them. I.e. this function can just as much be regarded as a cleaning step.
Another example of where this behaviour is extremely counter-intuitive / unproductive: if a DataFrame X is such that all n features are binary indicator values, but with NaN values being possible and only present in k of the n features, then using pd.get_dummies(X, X.columns, dummy_na=True, drop_first=True)
will produce a new DataFrame with n new columns, k of which are all constant and identical to each other.
This is not only extremely counter-intuitive, but if, as you said, encoding is supposed to be done "after cleaning", then this behaviour basically forces the user to clean the data again after encoding, meaning it breaks your ideal modeling scenario data flow as well (necessitating clean -> encode -> clean again).
Of course, it is trivial to do this step yourself using DataFrame.isnull
and a pd.concat
, but pd.get_dummies
would seem to be the function one would want to reach for to do this automatically.
the extra column generated is harmless since it has no variance to be picked up as a feature irrespective of the model
This is unfortunately just not true at all. There is nothing really harmless about e.g. k extra constant and identical columns if the later model is, say, a deep learner (in which case these constant columns will often be converted to floating point of some kind, and then there will be extra gradient computations due to a bunch of useless constants), or if that data is being passed to e.g. a stepwise feature selection model. As I also said in a multiprocessing context, k unexpected useless new features (especially if converting to float) can be a significant amount of extra memory, especially if on a cluster with e.g. 80 cores. This can happen in automated ML pipelines and the like. Whether extra constant and identical columns are harmless or not depends on the use-case and downstream task.
But anyway, I agree this is all fixed by just making the documentation clearer. It would also be nice to know the reasoning though, because I don't think "constant columns are harmless" is really generally true. If this is an assumption baked into pd.get_dummies
behaviour, the user should be informed.
In my opinion the default behaviour should to only add a NaN indicator column if NaN values are actually present...
In my opinion, it would be surprising to pass dummy_na=True
and not get the corresponding columns. In addition, the proposed behavior would make the schema of the result dependent on the values of the input. This is very undesirable behavior.
I'm strongly opposed here.
@rhshadrach I can definitely understand why people might have different expectations here about the behaviour, but that is why I ultimately filed this as a documentation issue. I wouldn't want behaviour to change for people who expected the current behaviour.
Are you saying you oppose changing the documentation too as well though?
EDIT: And if the above reasoning is that we want pd.get_dummies outputs to not depend on the values of the inputs, this also is currently not the case, since obviously the amount of columns produced by get_dummies
does indeed usually depend on the input:
>>> df = DataFrame({"a": [0, 1, 1], "b": [0, 0, 1]})
>>> pd.get_dummies(df, columns=["a", "b"], dummy_na=True, drop_first=True)
a_1.0 a_nan b_1.0 b_nan
0 False False False False
1 True False False False
2 True False True False
>>> df.iloc[2, 0] = 2
>>> pd.get_dummies(df, columns=["a", "b"], dummy_na=True, drop_first=True)
a_1.0 a_2.0 a_nan b_1.0 b_nan
0 False False False False False
1 True False False False False
2 False True False True False
Perhaps just to support the idea that people might have very different inututions here, e.g. scikit-learn MissingIndicator has the following argument:
features {‘missing-only’, ‘all’}, default=’missing-only’
Whether the imputer mask should represent all or a subset of features.
- If 'missing-only' (default), the imputer mask will only represent features containing missing values during fit time.
- If 'all', the imputer mask will represent all features.
I.e. the default behaviour is arguably closer to what I expected. I think a simple documentation change would make this clearer to users of pd.get_dummies
.
The addition of
If True, A NaN indicator column will be added even if no NaN values are present [insert reasoning here].
seems unnecessary to me, however no real opposition to its inclusion. I do agree if False NaNs are ignored.
could use better clarification; I'd suggest If False, NA values are encoded as all zero
.
the amount of columns produced by get_dummies does indeed usually depend on the input
This is in the definition of the operation, so is unavoidable. The point is to avoid it where ever possible.
PRs for improving the documentation are welcome!
take
Pandas version checks
main
hereLocation of the documentation
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
Documentation problem
The docs currently read for this function:
However, when no NaN values are present, a useless constant NaN indicator column is still added:
This is arguably quite unexpected behaviour, as constant columns do not contain information except in some very rare cases and for specific custom models.
I.e. for almost any kind of model this column will be ignored (but then annoyingly clutter e.g.
.feature_importances
variables and also perhaps needlessly increase compute times for algorithms that scale significantly with the number of features, but which may not have methods for ignoring constant input features). For data with a lot of binary features, and pipelines or models which also might do e.g. conversion to floating-point dtypes, all these useless extra constant features could also significantly increase memory requirements (especially in multiprocessing contexts).I imagine the intended design decision is that if you do e.g.
df1, df2 = train_test_split(df)
, and it ends up such thatdf1
doesn't have any NaN values for some feature"f"
, butdf2
does, then at least with the current implementation, the user can be ensured the following does not raise an AssertionError:But still, that is a pretty strange use-case, as dummification should generally happen right at the beginning on the full data.
In my opinion the default behaviour should to only add a NaN indicator column if NaN values are actually present... . I actually would consider this to be an implementation or design bug, for like 99% of use-cases. But at bare minimum this undesirable and unexpected behaviour should be documented with some reasoning.
Suggested fix for documentation
Just change the docs to: