How to deal with the imbalance data problem?

jiegzhan / multi-class-text-classification-cnn-rnn

Classify Kaggle San Francisco Crime Description into 39 classes. Build the model with CNN, RNN (GRU and LSTM) and Word Embeddings on Tensorflow.

https://www.kaggle.com/c/sf-crime/data

Apache License 2.0

599 stars 262 forks source link

How to deal with the imbalance data problem? #27

Open heinze007 opened 7 years ago

heinze007 commented 7 years ago

I tried to transplant the code on my own text classification data( 47 classes in 42000 records), finding out that the classifier would tend to choose the larger classes like THEFT, ASSULT and so forth. How you guys deal with the imbalance data to make them seems more 'balance'?

heinze007 commented 7 years ago

I've tried to replace the loss function, from Cross Entropy to Weighted Cross Entropy, to give the smaller groups more weights. It works out fairly but the accuracy got only around 70%...