idptools / parrot

Python package for protein sequence-based bidirectional recurrent neural network. Generalizable to a variety of protein bioinformatic applications.
MIT License
16 stars 2 forks source link

severe class imbalance #7

Closed andrefaure closed 2 years ago

andrefaure commented 2 years ago

Thanks for the easy to use software!

I am trying to use parrot for a classification task and got the following warning:

############################################# WARNING: Severe class imbalance detected in dataset.

Class frequencies: 1 : 4.3% 0 : 95.7%

Predictions will be strongly skewed towards overrepresented classes. Classification not recommended with current dataset.

#############################################

What is the best way to deal with this? If I subsample the negative sequences I will lose a ton of valuable training data...

Thanks!

degriffith commented 2 years ago

Hi Andre,

Sorry for the delayed response - turns out I didn't have notifications set up for this repo!

Class imbalance is an annoying, but often unavoidable issue in deep learning. If particularly strong imbalances are not dealt with prior to training, most deep learning classifiers will just learn to predict the majority class at the expense of the minority class (e.g. your network would likely learn to just predict every sequence as "0"). Even though this might lead to a higher accuracy, this generally is not what we want since the whole point of building a classifier for imbalanced data is to identify cases of this minority class.

The two main ways to deal with class imbalance are oversampling and undersampling, however both of these come with some drawbacks and there's not necessarily a "one-size-fits-all" solution for each dataset. It also really depends on the size of your dataset - undersampling works better when you have more data, and oversampling can work better with smaller datasets.

Subsampling down your majority class can potentially lose some valuable sequences, however for some data it might be possible to subsample in a rational way. For example, you could potentially try subsampling your majority class based on sequence similarity, such that the remaining sequences are as distinct as possible. It's also worth trying to subsample to a few different class ratios. Although networks tend to perform better with a 1:1 class ratio, you can still potentially get very accurate an unbiased networks with, 40:60 or 30:70 ratios, for example.

Oversampling is another option, but is a bit trickier to do with sequence data. Oversampling is adding more of your minority class to the dataset to achieve a better balance, typically by sampling from your existing data. Oversampling has the drawback of overfitting on the few datapoints you have in your minority class, and can really mess with internal k-fold cross validation if you are not careful.

I would recommend setting aside some subset of your data as a test set and trying a few of these methods, or combinations of them. I would also highly recommend checking out these two resources for additional ideas: 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset 10 Techniques to deal with Imbalanced Classes in Machine Learning

I hope that helps! Dan

andrefaure commented 2 years ago

Great thanks a lot @degriffith !