Handle imbalanced dataset

janhurst commented 4 years ago

The dataset is heavily imbalanced with only a very small percentage of records being for a TBI.

janhurst commented 4 years ago

There is a decent article on Towards Data Science and another good one on KDNuggets dealing with imbalanced data.

The main problem with resampling is we are moving our dataset far from the reality that it will need to deal with, i.e. may not generalize well.

One of the options may be to use a more appropriate performance measure, such as f1.

I've dropped a bunch more highly correlated variables, and using an f1 score with a simple tree classifier I am seeing around 0.865 or so f1. I'm much happier with this than the overly accurate "accuracy" measure of >99.9%

chauhan-bhavya commented 4 years ago

I read both of the articles. I would like to try oversampling once if it works fine.

NikiRPatel commented 4 years ago

I am using scale_pos_weight in XGBClassifier which handles highly imbalanced data while modelling.

doughnuted commented 4 years ago

We used TPOT and found XGBoost to be the best performing. F1/AUPRC is the better metric- remind me and I'll forward you the relevant article.

janhurst / unisa-tbi

Handle imbalanced dataset #17