scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 395 forks source link

PolynomialWrapper retains old values from previous calls to fit #313

Closed stepthom closed 1 year ago

stepthom commented 3 years ago

I think I've found a subtle bug in PolynomialWrapper.

The expected behaviour is that if there are n classes in the target column, then PolynomialWrapper will return n-1 new features for a given categorical feature.

And it works most of the time.

But, if, for example:

then the PolynomialWrapper might sometimes return n new features (instead of n-1).

Why? I think what is going on is that the PolynomialWrapper object's data structures are not being "reset" at all between calls to fit. In particular, feature_encoders is a dict:

https://github.com/scikit-learn-contrib/category_encoders/blob/a810a4b7abfce9fc4eb7fc401e3d37f2c1c6e402/category_encoders/wrapper.py#L49

but it never gets "cleared out" between calls to fit. Normally, that's not an issue, because the feature encoders stored in feature_encoders will be overwritten in a new call to fit:

https://github.com/scikit-learn-contrib/category_encoders/blob/a810a4b7abfce9fc4eb7fc401e3d37f2c1c6e402/category_encoders/wrapper.py#L67

But during some intense debugging I noticed that the OHE a few lines above:

https://github.com/scikit-learn-contrib/category_encoders/blob/a810a4b7abfce9fc4eb7fc401e3d37f2c1c6e402/category_encoders/wrapper.py#L59-L60

will sometimes return labels in a different order, so when the last label is dropped:

https://github.com/scikit-learn-contrib/category_encoders/blob/a810a4b7abfce9fc4eb7fc401e3d37f2c1c6e402/category_encoders/wrapper.py#L63

a different label might be dropped than a previous call to fit. So the effect is the feature_encoders dict will have a "stray" feature encoder left over from before that was never removed/overwritten, and then transform will add n features. Worse, that stray feature encoder was created from some other data and is effectively meaningless/incorrect.

In my particular use case, this bug was causing target leakage that was very difficult to detect.

stepthom commented 3 years ago

Note: a workaround is to add the following line to just before line 63 in fit (and to the similar line in fit_transform):

labels = labels.reindex(sorted(labels.columns), axis=1)

This will ensure that the labels are in the same order, and thus the same one will always be dropped.

bmreiniger commented 1 year ago

@PaulWestenthanner @stepthom I haven't looked carefully, but the attributes feature_encoders and label_encoder should probably just be initialized at the top of fit instead of in __init__. The label_encoder gets entirely overridden anyway, but such a change would reset feature_encoders fully, as needed? (Also better convention to name them with trailing underscores, but probably not worth the breaking backwards-compatibility.)