Closed stepthom closed 1 year ago
Note: a workaround is to add the following line to just before line 63 in fit
(and to the similar line in fit_transform
):
labels = labels.reindex(sorted(labels.columns), axis=1)
This will ensure that the labels are in the same order, and thus the same one will always be dropped.
@PaulWestenthanner @stepthom I haven't looked carefully, but the attributes feature_encoders
and label_encoder
should probably just be initialized at the top of fit
instead of in __init__
. The label_encoder
gets entirely overridden anyway, but such a change would reset feature_encoders
fully, as needed? (Also better convention to name them with trailing underscores, but probably not worth the breaking backwards-compatibility.)
I think I've found a subtle bug in PolynomialWrapper.
The expected behaviour is that if there are n classes in the target column, then PolynomialWrapper will return n-1 new features for a given categorical feature.
And it works most of the time.
But, if, for example:
fit
andtransform
get called more than once)then the PolynomialWrapper might sometimes return n new features (instead of n-1).
Why? I think what is going on is that the PolynomialWrapper object's data structures are not being "reset" at all between calls to
fit
. In particular,feature_encoders
is a dict:https://github.com/scikit-learn-contrib/category_encoders/blob/a810a4b7abfce9fc4eb7fc401e3d37f2c1c6e402/category_encoders/wrapper.py#L49
but it never gets "cleared out" between calls to
fit
. Normally, that's not an issue, because the feature encoders stored infeature_encoders
will be overwritten in a new call tofit
:https://github.com/scikit-learn-contrib/category_encoders/blob/a810a4b7abfce9fc4eb7fc401e3d37f2c1c6e402/category_encoders/wrapper.py#L67
But during some intense debugging I noticed that the OHE a few lines above:
https://github.com/scikit-learn-contrib/category_encoders/blob/a810a4b7abfce9fc4eb7fc401e3d37f2c1c6e402/category_encoders/wrapper.py#L59-L60
will sometimes return labels in a different order, so when the last label is dropped:
https://github.com/scikit-learn-contrib/category_encoders/blob/a810a4b7abfce9fc4eb7fc401e3d37f2c1c6e402/category_encoders/wrapper.py#L63
a different label might be dropped than a previous call to
fit
. So the effect is thefeature_encoders
dict will have a "stray" feature encoder left over from before that was never removed/overwritten, and thentransform
will add n features. Worse, that stray feature encoder was created from some other data and is effectively meaningless/incorrect.In my particular use case, this bug was causing target leakage that was very difficult to detect.