Combining with set_output can produce errors

scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders

BSD 3-Clause "New" or "Revised" License

2.41k stars 396 forks source link

Steps to Reproduce the Problem

df = pd.DataFrame({"C1": ["a", "a"], "C2": ["c", "d"]}) ohe = OneHotEncoder().set_output(transform="pandas") ohe.fit_transform(df.iloc[:1]) # commenting this line avoids the error ohe.fit_transform(df.iloc[:2]) # this line produces an error

ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

I think I see what's happening. In BaseEncoder.fit, https://github.com/scikit-learn-contrib/category_encoders/blob/06e46db07ff15c362d7de5ad225bb0bffc245369/category_encoders/utils.py#L323-L324 we set feature_names_out_ by looking at the encoder's transformed output (L324). But when sklearn is set to produce pandas output, it uses that attribute if available at L323. So it uses the old attribute. Depending on how sklearn picks its column names, we may be able to just delete the attribute somewhere earlier in fit. This probably affects all the encoders, too, not just OHE.

Update: I tried it out, and the error doesn't get thrown for most other encoders, because they don't change the number of columns; BinaryEncoder similarly fails. But even for other encoders, making the second fit_transform on a dataframe with new column names ends up using the old column names. Adding just self.feature_names_out_ = None before L323 above fixes the problem in all cases, PR to come.

scikit-learn-contrib / category_encoders

Combining with set_output can produce errors #437

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications