lvgig / tubular

Python package implementing transformers for pre processing steps for machine learning.
https://tubular.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
38 stars 14 forks source link

[Bug]: OneHotEncoderTransformer broken with sklearn>1.2.1 #308

Open davidhopkinson26 opened 6 days ago

davidhopkinson26 commented 6 days ago

What happened?

When trying to transform using this transformer we get an error

AttributeError: 'OneHotEncodingTransformer' object has no attribute '_drop_idx_after_grouping'

This attribute was added in sklearn> 1.2.1 and is set in init for OneHotEncoder. This should be fine but in OneHotEncoderTransformer it looks like we do not actually persist the initialised object, instead calling

in init

# Set attributes for scikit-learn's OneHotEncoder
        OneHotEncoder.__init__(
            self,
            sparse_output=False,
            handle_unknown="ignore",
            dtype=dtype,
            **kwargs,
        )

in fit

OneHotEncoder.fit(self, X=X[self.columns], y=y)

and in transform:

X_transformed = OneHotEncoder.transform(self, X[self.columns])

So the intialisation in this classes init method does not actually persist to he object called in transform. Up to now this has run as just calling the transform method without any initialisation worked ok (however it may not have been functioning as expected!). However with the newer versions of sklearn the method looks for an attribute which does not exist (a) because it is called within the wrong class so self directs it to OneHotEncoderTransformer and (b) because the OneHotEncoder has not been initialised.

To fix this we should assign an instance of OneHotEncoder to an attribute of OneHotEncoderTransformer, say '_encoder', and persist this through init, fit and transform.

Environment

sklearn 1.5.2 tubular 1.3.1

Minimum reproducible code

x = OneHotCoderTransformer()

x.transform(X)

Relevant error output

`AttributeError: 'OneHotEncodingTransformer' object has no attribute '_drop_idx_after_grouping'`

Code of Conduct

davidhopkinson26 commented 6 days ago

The way it is now relies on the calls to init and fit methods in OneHotEncoder setting attributes in the OneHotEncodeingTransformer. This is a very unsafe way to do things (as we have just found out!) but has probably been working sort of ok up to now. Not quite sure why the call to init is not setting the '_drop_idx_after_grouping' attribute but still expect that swithcing to initialising and persisting a OneHotEncoder object will be a safer way for this to work.