Open coconutattitude opened 10 months ago
Hey @coconutattitude, thanks for this issue and PR! Could you link the Kaggle competitions you mentioned? It's always great to see new use cases!
Sure, it was this Kaggle Playground Series competition: https://www.kaggle.com/competitions/playground-series-s3e19/overview
Problem Description
Sometimes I have a feature in my dataset that has 20 or more unique values. I noticed that some of those values could be grouped together with their frequencies, so I would end with 4-6 values instead of 20. This would help later while doing one-hot encoding and clustering. There are the cases when similiar number of frequencies happens because the categories share some kind of similiarity.
Feature Description
The chosen column will be processed in the following manner:
Alternative Solutions
No response
Additional Context
At work and during kaggle competitions I often end up working with highly cardinal data. Some of the variables have high number of unique values and high variance. I realized that in order to perform further operations and not increase too much the dimensionality of my data, I could group the data based on the frequency of its values before I perform one-hot encoding and clustering. In my context doing syntactic comparison doesn't make sense because I use machine addresses.