scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.39k stars 393 forks source link

Combining with set_output can produce errors #437

Closed JulienRoussel77 closed 3 months ago

JulienRoussel77 commented 3 months ago

Expected Behavior

Calling set_output(transformer="pandas") should have no impact on a OneHotEncoder, since the outputs are already dataframes.

Actual Behavior

The OneHotEncoder develops an inconsistent behavior, producing errors in the case of subsequent fit_transform calls.

Steps to Reproduce the Problem

df = pd.DataFrame({"C1": ["a", "a"], "C2": ["c", "d"]})
ohe = OneHotEncoder().set_output(transform="pandas")
ohe.fit_transform(df.iloc[:1]) # commenting this line avoids the error
ohe.fit_transform(df.iloc[:2]) # this line produces an error

ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

Specifications

pandas : 2.0.1 sklearn : 1.3.2 category_encoders : 2.6.3

bmreiniger commented 3 months ago

I think I see what's happening. In BaseEncoder.fit, https://github.com/scikit-learn-contrib/category_encoders/blob/06e46db07ff15c362d7de5ad225bb0bffc245369/category_encoders/utils.py#L323-L324 we set feature_names_out_ by looking at the encoder's transformed output (L324). But when sklearn is set to produce pandas output, it uses that attribute if available at L323. So it uses the old attribute. Depending on how sklearn picks its column names, we may be able to just delete the attribute somewhere earlier in fit. This probably affects all the encoders, too, not just OHE.

Update: I tried it out, and the error doesn't get thrown for most other encoders, because they don't change the number of columns; BinaryEncoder similarly fails. But even for other encoders, making the second fit_transform on a dataframe with new column names ends up using the old column names. Adding just self.feature_names_out_ = None before L323 above fixes the problem in all cases, PR to come.