Adding a frequency encoder

coconutattitude commented 10 months ago

Problem Description

Sometimes I have a feature in my dataset that has 20 or more unique values. I noticed that some of those values could be grouped together with their frequencies, so I would end with 4-6 values instead of 20. This would help later while doing one-hot encoding and clustering. There are the cases when similiar number of frequencies happens because the categories share some kind of similiarity.

Feature Description

The chosen column will be processed in the following manner:

we will take the value counts of the column
we will run pandas.cut() method on the value counts of the column with the bins provided by the user ( in the future those might be quantiles)
then we will map the result to the original column which will result with a column with values grouped together basing on their frequency.

Alternative Solutions

No response

Additional Context

At work and during kaggle competitions I often end up working with highly cardinal data. Some of the variables have high number of unique values and high variance. I realized that in order to perform further operations and not increase too much the dimensionality of my data, I could group the data based on the frequency of its values before I perform one-hot encoding and clustering. In my context doing syntactic comparison doesn't make sense because I use machine addresses.

Vincent-Maladiere commented 10 months ago

Hey @coconutattitude, thanks for this issue and PR! Could you link the Kaggle competitions you mentioned? It's always great to see new use cases!

coconutattitude commented 10 months ago

Sure, it was this Kaggle Playground Series competition: https://www.kaggle.com/competitions/playground-series-s3e19/overview

skrub-data / skrub