seoungwugoh / STM

Video Object Segmentation using Space-Time Memory Networks
405 stars 81 forks source link

Question about class imbalance for training #9

Closed chenz97 closed 4 years ago

chenz97 commented 4 years ago

Hello, thanks for your great work and code!

When I try to train the model by myself, I found class imbalance seems to be a problem. Background pixels are far more than foreground pixels, which makes the training difficult. Could you please tell me how did you solve the problem? Did you use some kind of re-weighting or anything else? Thank you very much!

seoungwugoh commented 4 years ago

We used simple standard cross-entropy, and found no training issues. Try to filter out very small objects in the training samples, and samples with no objects after random crop.
Or, you may try to use class re-weighting as in OSVOS paper.

chenz97 commented 4 years ago

Hello @seoungwugoh , thanks for your timely reply! I will try it according to your suggestions. Thanks a lot!

chenz97 commented 4 years ago

Hi @seoungwugoh , sorry to bother again. I noticed that you convert the label to one-hot in dataset.py, so when you trained, did you use the nn.BCELoss or use the nn.CrossEntropyLoss? And do you have any idea why the chosen one is preferred over the other? Thanks a lot!

seoungwugoh commented 4 years ago

@chenz97 We used nn.CrossEntropyLoss as our network outputs a 2-channel map. It is my old habit to use nn.CrossEntropyLoss and softmax over nn.BCELoss and sigmoid. There will be no big difference though.

chenz97 commented 4 years ago

Hi @seoungwugoh , thanks for your reply. So even in the multi-object case, the loss is calculated separately for each object, instead of stacking them in channel and use an one-channel GT mask (with values up to K) to calculate loss on a single nn.CrossEntropyLoss, right? Thanks a lot!

seoungwugoh commented 4 years ago

@chenz97 For multi-object cases, losses are computed for all objects once after the soft aggregation operation. In the soft aggregation operation, the probability map for each object is combined into a single probability map for all objects with the size [H x W x (O+1)] where O is the number of objects, and one additional channel is for BG.

See the supplementary materials for the details: http://openaccess.thecvf.com/content_ICCV_2019/html/Oh_Video_Object_Segmentation_Using_Space-Time_Memory_Networks_ICCV_2019_paper.html

chenz97 commented 4 years ago

Hi @seoungwugoh , thanks for your reply! I got it.