scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.39k stars 393 forks source link

sklearn compatibility with feature_names_out #395

Closed PaulWestenthanner closed 1 year ago

PaulWestenthanner commented 1 year ago

Expected Behavior

The behaviour of get_feature_names_out should be the same as for sklearn

Actual Behavior an explanation in doc.

At the moment feature_names_out_ attribute is set in fit and returned by get_feature_names_out function. In sklearn the get_feature_names_out function does not just return the values set in fit, but determines the output feature names from given input feature names if possible.

Examples

I'd like to discuss this idea a little with the community. Please add some Pros and Cons of this approach. I'm sure I'm missing something

Pros

Cons

RNarayan73 commented 1 year ago

I agree that aligning with the updated sklearn version 1.2 would be very helpful.

In goes without saying that when using pipelines with multiple transformers, it is useful to have a df at the output end with features traceable to the input features.

The main benefit occurs downstream when the transformed dfs are passed on to predictors. For example, I have a situation currently wherein both XGBT and LGBM can take categorical features directly as an input without having to encode them. They rely on the categorical dtype of the column in the df to identify such features. Without a df output from the transformers in the pipeline with properly labelled columns, it is unnecessarily complicated if not virtually impossible to change the dtype to categorical within the pipeline before it goes on to the predictor.

I was very excited by this new feature in sklearn 1.2 hoping it would deal with it, but not realising that 3rd party transformers also have to play ball and support the api.

I therefore look forward to this being implemented soon!

Narayan

JaimeArboleda commented 1 year ago

Yes, I agree with all comments. In fact I am playing with the TargetEncoder and I found that it causes problems when integrated in a Pipeline or in a ColumnTransformer of sklearn, because get_feature_names_out will be called with a parameter and you get this kind of error:

BaseEncoder.get_feature_names_out() takes 1 positional argument but 2 were given

I will try to submit a PR if I find a simple solution that does not break anything.

JaimeArboleda commented 1 year ago

After a more careful review, I think this more work than expected. The reason is that, if you want to include a category encoder in a pipeline, unless the encoder is in an initial position, the input that it will receive is a numpy ndarray, not a pandas dataframe. And so, the column names won't exist (they will default to integers from 0 to n-1). And this is not a good behaviour.

That's why you cannot rely on the names of the inputs you see during fitting, because they might not exist!

I think that, unfortunately, each category encoder should define its own method for this, because depending on the EncodingRelation they will have different behaviour. So that's much more work than expected...

Well, in the case of ONE_TO_ONE relationships it will be straightforward.

I have made a PR with the partial fix, to see if the main contributors of the project like the approach and we can continue the work.

PaulWestenthanner commented 1 year ago

Hi Jaime,

thanks for your work. When exactly does the pipeline example fail? I think we have tests for fit and transform methods using pipelines. But you call pipeline.get_feature_name_out to get the error, right?

I agree that it is a lot of work. Basically at the moment output feature names are calculated on the fly when fitting whereas in sklearn the logic to determine output feature names is in the get_feature_names_out function. Hence we need to look at every single encoder and understand what it does to obtain output colnames.

I think that, unfortunately, each category encoder should define its own method for this

This is exactly the point, and also the way sklearn does it. But still I think it is worth the effort and as a maintainer I'd love to see it implemented. Would you be willing to invest more time? I tagged this issue as a good-first-issue not because it is easy (yet do-able) but working on it will give a good overview of the project and will enable a lot of future commits.

JaimeArboleda commented 1 year ago

Thanks a lot!! Yes, I will be willing to invest more time, but I'd like to share my ideas with you so that we agree on how to tackle this first...

I will share with you a google colab notebook to explain more carefully the issue.

JaimeArboleda commented 1 year ago

Ok, here is the colab notebook explaining the issue and a path to a solution (I hope it makes sense!).

https://colab.research.google.com/drive/1lJJvXyAOoG9MBtNiH2qSTrqbzZtEPTeB?usp=sharing

As a side comment, I think that in general the fact that inputs and outputs of sklearn objects can be either dataframes or np.arrays is very problematic and leads to lots of edge cases and needs for workarounds... For me keeping track of the columns and how they transform during the preprocessing is very valuable, and it's a pity that we have this complex situation, that is the cause why the "naive solution" cannot work.

PaulWestenthanner commented 1 year ago

Hi Jaime,

I think this is going into the right direction:
Basically every encoder needs to implement the logic to get output features, there is no way around this.
It makes sense to look at how sklearn is handling this:

Here's a list of encoders and how output features will look like:

Does this make it clear? Does it make sense to you? Did I miss anything?

JaimeArboleda commented 1 year ago

Hello Paul!

Thanks a lot. I agree it makes a lot of sense to follow the same path as in sklearn. I've been taking a look and there are new things that I did not know. For example they have created an API called set_output, that allows developers of estimators and transformers to outupt pandas dataframes based on some configuration. I guess it would be great to follow this API, do you agree? See here

I will start by creating more tests for feature names, at least ffor the following cases (for all Encoders):

And then I will try to implement all needed changes following how sklearn is doing it.

I have some questions:

  1. Do you think it makes sense to maintain the OneHotEncoder class? I mean, as sklearn has it too, do you see if there is added value in the implementation provided by Category Encoders? In case you want to keep it: do you think column names should be sorted by value? I discovered that they are not sorted by default. For example, if you have a dataframe with values ['A', 'B', 'C'] in a column named 'col', then the OneHotEncoder may create the dummies columns in an unsorted order, for example, 'col_B', 'col_A', 'col_C'. I believe this is confusing, and sklearn sorts them.
  2. Do you think this code that is in the BaseEncoder.fit(X, y) should be kept?

    # for finding invariant columns transform without y (as is done on the test set)
    X_transformed = self.transform(X, override_return_df=True)
    self.feature_names_out_ = X_transformed.columns.tolist()

Maybe this was a workaround that we should try to fix?

PaulWestenthanner commented 1 year ago

Hi,

thanks for sharing the set_output docs. I think this is exactly what we need and should go for. Also the other Mixin mentioned there could be useful.
Regarding tests I agree that this is a good testing setup. I think we're missing tests for pipelines altogether, so we should really introduce them.
As for your questions:

  1. Good question. I think I'd like to keep it. Our implementation does provide some features that the sklearn version does not have (i.e. handle_missing and handle_unkown = indicator and handle_unkown = return_nan and also drop_invariant). The handle_unkown=indicator might actually make problems in get_feature_names_out since it will add columns during predict time. On the other hand the sklearn version has the sparse option which is a nice feature that we miss. So it might make sense to reconsider this for the future. But for the time being I'd keep it.
  2. No, this should probably be changed to use the get_feature_names_out function. But I think we probably cannot avoid the transform (although it is a little ugly) since for finding invariant columns we need to do a transform
JaimeArboleda commented 1 year ago

Ok, good. Yes, I will try to follow as much as possible the sklearn path. So there are two things to adapt, that I will try to include in the PR:

Anyway, the second question is a technical point, I think I can start and we can adjust later. It's more important for me to be clear if we want to change the API (regarding the return_df input).

PaulWestenthanner commented 1 year ago

Perfect!
I agree that we should remove the return_df argument in the long run and replace it by set_output. However, since this is a breaking change and is actually the default option we can't replace it without a transition period. So for the time being I think we should introduce a future warning. I think a lot of user rely on getting a dataframe as output and they'd need to adjust on their ends.
Currently we're on version 2.6.0. If we only release a future warning but keep return_df I'd release your PR as version 2.6.1, and then maybe in 2.7 or even 2.8 (depending on how frequent we'll release) we could drop the return_df. I'd like to give the users at least a couple of months to adopt.
Regarding tests we're actually not using the tox file in our pipeline. Our pipeline will just test different versions of python and the latest versions of pandas/numpy/sklearn not all versions specified in the requirements file. So if we rely on set_output we'd basically need to bump the sklearn version to 1.2 which is rather new. This is also basically a trade-off between being fast paced and ensuring stability/compatibility where as a maintainer I favour the latter.
Considering this, we should maybe just introduce a future warning for the return_df parameter and wait to replace the parameter.
Alternatively, we could change what the return_df parameter does: Instead of returning a df in the base encoder we could call self.set_output(transform="pandas") in the __init__ function of the encoder. This would be a compromise, keeping backwards compatibility (and the argument) but already removing any custom logic and hence aligning more with sklearn.

What do you think?

JaimeArboleda commented 1 year ago

Thank you. Yes, this could be a good idea...

However, to be honest, I am feeling less and less comfortable and confident with this PR. I started early this morning preparing the new test suites, following the order of the encoders, and I panicked when I discovered how entangled this library is with having pandas dataframes as inputs and outputs. I am not saying that I don't understand this rationale, but it's a big (very big) change from how sklearn treats inputs and outputs. In sklearn, inputs and outputs are expected to be either pandas dataframes or numpy ndarray. Both options are allowed. And this is reflected (among other things) in the parameters of the transformers. For example, for the OrdinalEncoder of sklearn, if you wish to specify the order in which you want to encode the categories, you can use this attribute:

categories: ‘auto’ or a list of array-like, default=’auto’
        Categories (unique values) per feature:
        ‘auto’ : Determine categories automatically from the training data.
        list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values, and should be sorted in case of numeric values.

However, to achieve similiar functionality with the OrdinalEncoder of category_encoders, you have to use the mapping parameter:

mapping: list of dicts
        a mapping of class to label to use for the encoding, optional.
        the dict contains the keys 'col' and 'mapping'.
        the value of 'col' should be the feature name.
        the value of 'mapping' should be a dictionary or pd.Series of 'original_label' to 'encoded_label'.
        example mapping: [
            {'col': 'col1', 'mapping': {None: 0, 'a': 1, 'b': 2}},
            {'col': 'col2', 'mapping': {None: 0, 'x': 1, 'y': 2}}
        ]

The big difference is that category_encoders is using the names of the columns, instead of the order in which they appear, hence it's implicitely expecting a pandas dataframe as input. I know that if the input of fit is a numpy ndarray, it's converted into a dataframe, but this is a hack that plays (IMHO) against the composability of the transformer. As showed in the Colab example, when the transformer is part of a Pipeline or a ColumnTransformer, if a previous transformer outputs a numpy ndarray, then this is what will be received by the fit method. And the hack of converting X to a pandas dataframe will make the mapping incorrect and create subtle bugs when the encoder is part of a complex object. I have created an example showing that problem.

And it's not so easy to get around this problem because the mapping is not only a parameter that can be used by the user to specify how to map the categories to the new values, but it's also used quite a lot internally. Honestly I don't know how to get around this problem... I think this asks for a big redesign of the whole library and I don't feel capable of doing that.

The assumption of the input being a dataframe is everywhere!

So I think there are two possible paths:

  1. Redesign the whole library removing the implicit assumption of X being a dataframe. This will be a big breaking change. In the long term it's maybe the best solution (for achieving the bigest compatibility with sklearn and the way it works) and it will give more clarity, but it will be a hard path. I don't feel capable of doing that.
  2. Use the new set_output API of sklearn to ensure that, even if the CategoryEncoder is part of a Pipeline, all estimators will ouptut pandas dataframes. I am talking about doing this:

    In __init__.py:

    from sklearn import set_config
    set_config(transform_output="pandas")

    Well, before setting the config, we could retrieve the current config value and send a warning in case we are modifying the current configuration. A warning with a message like: "To ensure full composability of CategoryEncoders with other sklearn estimators and transformers, and to get predictable column names when using get_feature_names_out, the configuration of sklearn has been set to transform_output="pandas". This might be ugly, but it will be the simplest solution given the current design.

    This solution has another side benefit for solving the current issue: I believe that the naive solution I initially thought will be enough if we set the config like that.

What do you think?

JaimeArboleda commented 1 year ago

I have tested the second solution and it works ok. All new tests pass without issues and only small modifications of the codebase were needed. If setting this configuration of sklearn is not that ugly, I think this solution could be very practical.

Please, check the PR

However, old tests need plenty of adjustments, because sometimes they use plain lists as inputs (and this generates a problem when trying to build the pandas dataframe) or because they try to test return_df=False (which, if we follow this path, won't have any effect). I am trying to repair them, but it's taking more time than expected... So I believe it's better if I leave it like that and wait until your confirmation as a maintainer...

PaulWestenthanner commented 1 year ago

You're right, this library internally works with dataframes.
However I'd like to keep the behaviour that you can input arrays and lists and also not necessarily get a df back (return_df=False).
I've just been thinking about columns transformers: You need them if you only want to modify some columns and leave others untouched. This is basically what all encoders do! So I do not see a use-case where it makes sense to use a column transformer over integrating the encoder in a pipeline.
In case of using Pipelines you can manually set the previous step to output dataframes

categorical_preprocessor = Pipeline(
    steps=[
        ("imputation_constant", SimpleImputer(fill_value="missing", strategy="constant").set_output(transform='pandas')),
        ("ordinal", OrdinalEncoder(mapping=[{'col': 'col_a', 'mapping': {'a': 2, 'b': 1, 'c': 3}}])),
    ]
)

This is not very pretty but works. However I think if we can make the existing thing work, we should rather do this (and maybe add some docs on how to use with pipellines) rather than refactoring all of the code because we don't have (or at least I don't have) the resources for this.

Nevertheless we can include the functionality to get_feature_names_out right? even if we do not achieve full compatiblity with sklearn? It will be a step in the right direction.

Thanks for pointing all of this out. Great work so far! I'm also not super happy with the way it is ^^

JaimeArboleda commented 1 year ago

I see your points!

I think I would prefer a huge refactoring, but I don't have time, knowledge and energy to do that.

But you are right, if there is a solution that, in order to work 100% smoothly, involves only setting the output in skearn, we can just leave category_encoders with minimal changes so that no Exception are raised, and point out in the documentation how to ensure that the feature names are right.

This is what I have done in the last commit that you'll see in the PR. I think it's a good tradeoff, and the changes are truly minimal. I like it much more than my previous work with forcing the configuration of sklearn in the __init__.py file (in fact, I was finding a lot of problems with the existing tests...).

This is a bit off topic, but I am very, very fond of Pipelines and ColumnTransformers. When I want to optimize hyperparameters, including hyperparameters of the preprocessing steps (and this library, which is fairly knew to me, contains lots of examples of hyperparameters that you might want to play with), I like defining the whole preprocessing + model in a single object, and then using either RandomizedSearchCV or some method from skopt for finding the best hyperparameters. I even developed a small library for doing this procedure in a Nested Cross Validation scheme. If you are interested, this notebook from the library might give you an idea of why I find it useful to include in a single complex object the whole pipeline.

Going back to our work, please take a look at the PR and tell me if you are happy with it or if you want to modify something.

Thanks!

JaimeArboleda commented 1 year ago

Please, check the last commit I uploaded just right now, 673c876, because I fixed some tests that were failing (main reason was that get_feature_names_out outputs a np.array, as in sklearn, instead of a list...). Not a big deal, but now all tests pass OK.

PaulWestenthanner commented 1 year ago

Great stuff you've been doing. I'll review your code probably on Friday or the weekend, since I'll be a little busy for the rest of the week. Your support (and thorough knowledge about pipelines) is greatly appreciated

JaimeArboleda commented 1 year ago

Thanks a lot for your kind words. It's been a pleasure. :)

bmreiniger commented 1 year ago

Did #398 leave anything from this issue to do? (If so, can you summarize it, maybe spinning off a new issue; and if not, close the issue?)

PaulWestenthanner commented 1 year ago

Hi Ben, I've summarized the open point in #406 and will close this one