Closed janhurst closed 2 years ago
There is a decent article on Towards Data Science and another good one on KDNuggets dealing with imbalanced data.
The main problem with resampling is we are moving our dataset far from the reality that it will need to deal with, i.e. may not generalize well.
One of the options may be to use a more appropriate performance measure, such as f1.
I've dropped a bunch more highly correlated variables, and using an f1 score with a simple tree classifier I am seeing around 0.865 or so f1. I'm much happier with this than the overly accurate "accuracy" measure of >99.9%
I read both of the articles. I would like to try oversampling once if it works fine.
I am using scale_pos_weight in XGBClassifier which handles highly imbalanced data while modelling.
We used TPOT and found XGBoost to be the best performing. F1/AUPRC is the better metric- remind me and I'll forward you the relevant article.
The dataset is heavily imbalanced with only a very small percentage of records being for a TBI.