meijieru / crnn.pytorch

Convolutional recurrent network in pytorch
MIT License
2.4k stars 657 forks source link

Imbalanced classes #67

Closed skalinin closed 6 years ago

skalinin commented 7 years ago

Hi, My dataset is very imbalanced (for example, the class ‘q’ occurs about 100 times in the dataset, but the class ‘a’ may be more than 10 thousand times). What should I do? How can I use class weights in your code? I think it may looks like this. We have rnn_logits - this is output from RNN, what if I multiply it by class weights before put it in CTC loss? Then CTC loss would have the greater weights for rare classes, and that would impact to backpropagation. Am I right? Could you please help me?

meijieru commented 7 years ago

Operating on logits is wrong. Data balance for each element is not easy to do, you may have to deal with source code for CTC loss.

skalinin commented 7 years ago

Thanks for answering! "deal with source code for CTC" - it sounds pretty difficult... what else may i do? I heard about SMOTE/data augmentation, so the classes would be balanced after generation more data. Or maybe combine batches the way that it would have balanced classes into it?

meijieru commented 7 years ago

It may be a solution.

skalinin commented 6 years ago

Thanks again for your help! Also as I understood, you can create as much data, as you need to (https://github.com/Belval/TextRecognitionDataGenerator) And from my experience, the question about balancing data not so important (in case of CRNN) as how much data you have. I've needed to generate a few million images before came closer to solve my problem.