scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

migrate from get_feature_names to get_feature_names_out #348

Closed iuiu34 closed 1 year ago

iuiu34 commented 2 years ago

scikit-learn 1.0.0 moved to be using get_feature_names to use get_feature_names_out link

is category-encoders planning to support this migration?

kylegilde commented 2 years ago

+1, I hope so!

PaulWestenthanner commented 2 years ago

That makes sense. I'll deal with it.

nilslacroix commented 2 years ago

Is this being adressed or do you know a workaround for the problem?

iuiu34 commented 2 years ago

Proper solution is to define get_feature_names_out natively. @PaulWestenthanner is working on this (in case not, happy to do a PR)

Meanwhile, as a temporary solution, you can do a wrapper

import numpy as np
from category_encoders.count import CountEncoder as CountEncoder_
from sklearn.utils.validation import _check_feature_names_in

class CountEncoder(CountEncoder_):
    def get_feature_names_out(self, input_features=None):
        """Get feature names out."""
        # check_is_fitted(self)
        input_features = _check_feature_names_in(self, input_features)
        output_features = [f"{input_feature}_as_float".upper()
                           for input_feature in input_features
                           ]
        return np.asarray(output_features, dtype=object)
nilslacroix commented 2 years ago

Thanks for the effort, this works for my target encoder, but it fails for One Hot Encoding or Binary Encoder. For example when i encode one column with LeaveOneOutEncoder() and use get_feature_names_out() I get:

target__MEANPRICEPOSTCODE_AS_FLOAT But when I use the same code for Binary Encoder I get:

'enc_plz__POSTCODE_AS_FLOAT'

But there should actually be as many of these as there are columns right? Like enc_plz__1, enc_plz__2 ... etc. I guess this is related to the fact that the target encoder only returns one coolumn wheras binary and onehot encoders create multiple new columns from one feature. Unfortunately I do not know how this works internally in category encoders.

nilslacroix commented 2 years ago

Okay I just took a look at the code of category encoders and changed your class to:

from category_encoders import BinaryEncoder as BE

class BinaryEncoder(BE):
    def get_feature_names_out(self, input_features=None):
        """Get feature names out."""
        # check_is_fitted(self)
        features = self.get_feature_names()
        return np.asarray(features, dtype=object)

Do you think this is right? Sorry I am no python pro, even wondering why this works and I do not have to user super() to get the function.

PaulWestenthanner commented 2 years ago

@iuiu34 I haven't had time to look at it yet (probably not before next week). If you want to create a PR for this please go for it.
What exactly needs to be changed? Should we introduce a feature_names_in_ attribute as well as renaming get_feature_names to get_feature_names_out? I'm not quite sure why you would convert the feature to upper case and append the _AS_FLOAT suffix?

iuiu34 commented 2 years ago

for the PR

  1. shall we deprecate get_feature_names? or put a deprecation-warning instead?
  2. shall we move from feature_names to feature_names_in_? In case yes, can we declare feature_names_in_ only in fitand deprecate 'None' declaration in __init__? As in sklearn, all attributes that end with _ have to be declared in fit
PaulWestenthanner commented 2 years ago

for 1. I think would make most sense to keep the get_feature_names function but flag it with a deprecation warning and let it just call the new get_feature_names_out rather than implementing the logic.
for 2. the thing that is called feature_names at the moment are the output features, not the input features. I do agree however that we should probably also follow the sklearn way here and also save the feature_names_in_ somewhere. Regarding the tailing underscore for attributes declared in fit I'm d'accord

iuiu34 commented 1 year ago

haven't had time to work on this. Anyone else has done any work about it?

PaulWestenthanner commented 1 year ago

I just created something, please review it whoever wants to. My draft does not add feature_names_in though

iuiu34 commented 1 year ago

yep, as you mention above, good quick fix for the get_feature_names_out

But, in the future, in another release, would be nice to have feature_names_in too

PaulWestenthanner commented 1 year ago

I just added the get_features_names_in function.

There is however a rather nasty problem:
The self.cols attribute is the same as feature_names_in_ but we cannot just name it like this because a sklearn estimator requires that you have an attribute xyz if xyz is a parameter in __init__. I guess this is e.g. used for cross validation. Renaming cols in __init__ is a major breaking change that I wouldn't want to make now.

jawsem commented 1 year ago

Is the self.cols attribute actually the same as feature_namesin?

When I used the old feature_names previously.

self.cols = the features you are encoding self.feature_names = all the features fed to the encoder regardless if you are doing encoding or not

For compatibility with scikit pipelines I think it it would be useful to have a feauture_namesin attribute that gives you all the features fed to the encoder like feature_names did before.

PaulWestenthanner commented 1 year ago

self.cols = the features you are encoding self.feature_names = all the features fed to the encoder regardless if you are doing encoding or not

you're right. I missed this. However self.feature_names is all the columns of a data frame after transform, so it is the feature names out. But then we need to introduce feature_names_in as a separate attribute anyways and fill it in the fit step. This should be rather straight forward then. Thanks for pointing out the difference!