Closed JulienRoussel77 closed 7 months ago
I think I see what's happening. In BaseEncoder.fit
, https://github.com/scikit-learn-contrib/category_encoders/blob/06e46db07ff15c362d7de5ad225bb0bffc245369/category_encoders/utils.py#L323-L324
we set feature_names_out_
by looking at the encoder's transformed output (L324). But when sklearn is set to produce pandas output, it uses that attribute if available at L323. So it uses the old attribute. Depending on how sklearn picks its column names, we may be able to just delete the attribute somewhere earlier in fit
. This probably affects all the encoders, too, not just OHE.
Update: I tried it out, and the error doesn't get thrown for most other encoders, because they don't change the number of columns; BinaryEncoder
similarly fails. But even for other encoders, making the second fit_transform
on a dataframe with new column names ends up using the old column names. Adding just self.feature_names_out_ = None
before L323 above fixes the problem in all cases, PR to come.
Expected Behavior
Calling set_output(transformer="pandas") should have no impact on a OneHotEncoder, since the outputs are already dataframes.
Actual Behavior
The OneHotEncoder develops an inconsistent behavior, producing errors in the case of subsequent fit_transform calls.
Steps to Reproduce the Problem
ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements
Specifications
pandas : 2.0.1 sklearn : 1.3.2 category_encoders : 2.6.3