scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

[FEATURE] Categorical Variable Concatenation #345

Closed Pacman1984 closed 1 year ago

Pacman1984 commented 2 years ago

Expected Behavior

Also posted in scikit-lego issue, but maybe its better implemented here.

Concatenating categorical variables is a powerful feature engineering technique, often used in competitions. You could watch the 9 minuts of this Video for understanding the topic: Winning Solution --> RecSys 2020 Tutorial: Feature Engineering for Recommender Systems,

Categorical Variable Concatenation is not implemented in scikit-learn or scikit-learn-contrib packages.

I have coded this feature in a separate repo catcomb and would implement this solution in category_encoders, if you agree.

Basically, what it does, it concatenates all categorical columns with each other based on some parameters you can chose.

image

Its a ColumnsTransformer where you can choose

Example pipe = Pipeline([("catcomb", ColumnsConcatenation(columns='auto', level=2, max_cardinality=500))])

PaulWestenthanner commented 2 years ago

Hi @Pacman1984

I understand that this technique can improve model quality and that you want to contribute it to a bigger framework. Both of these points make absolute sense.
I'm not quite sure though if category_encoding is the correct library for this since the method does not encode a categorical feature to a numeric one. It would be better placed in a dedicated feature engineering library or am I getting this wrong? Unfortunately sklearn-contrib does not have a feature engineering library. Let's keep this issue open for the time being and let's get more input/opinions of the community.

PaulWestenthanner commented 1 year ago

A similar discussion has taken place already in #227 and was considered off topic and too broad which I also agree with. So I'm closing this issue