rykov8 / ssd_keras

Port of Single Shot MultiBox Detector to Keras
MIT License
1.1k stars 552 forks source link

hard negative mining #5

Closed mks0601 closed 7 years ago

mks0601 commented 7 years ago

hi In last part of ssd_train.py, you add pos_conf_loss and neg_conf_loss. When you calculate neg_conf_loss, you just select top_k boxes from max_conf, however, i think max_conf can include positive boxes(matched to a gt) because you did not restrict max_conf to have y_pred[:,:,-8] '0' or y_pred[:,:,4] '1'. What do you think about it?

Also, I`m implementing SSD300 in pascal voc on my own. However, when i draw confusion matrix for each epoch, most of the samples are biaised to negative class(background class). Can you give me any comment?

rykov8 commented 7 years ago

@mks0601 that's a good comment, thanks! You are right, there is a mistake in negative mining. I believe, the easiest solution is to change this line

_, indices = tf.nn.top_k(max_confs, k=num_neg_batch)

to the following one:

_, indices = tf.nn.top_k(max_confs * (1 - y_true[:, :, -8]), k=num_neg_batch)

If I am right, here we set all confidences, matched to gt, to 0 and that is why we will select only negative boxes.

As for your second question, the simplest thing that comes to my mind is that there is some problem with negative mining in your implementation, probably, you add too many negative priors to loss. Unfortunately, I have tried to train SSD on a different dataset and haven't noticed such bias. I suggest to ask the authors of the original paper whether they have noticed such behaviour.

mks0601 commented 7 years ago

Thanks for answering. I have a question about loss function because there are some ambiguous annotations in paper.

It seems you calculate the confidence loss for all prior boxes and select positive entries and negative entries from the loss tensor.

When you calculate the prob of predicted, do you use all class(include background) to calculate softmax function?(exp(c) / sum_exp(c)) or exclude background?

Do you use all class(positive class(dog, cat ..) + background class) when you calculate the confidence loss of all prior box? In other words, do you set (batchSz, boxNum, classNum) as predicted and (batchSz, boxNum, 1) as gt to calculate the confidence loss? (classNum include background class)

If correct, if i have positive boxes and negative boxes already(no need to select from all boxes as yours), then just calculate the confidence loss for all the boxes that i have already is the same with your implementation?

rykov8 commented 7 years ago

I tried to follow the original Caffe implementation of the multibox_loss_layer. If I got everything right, they include background to softmax activation and therefore to confidence loss. So, yes, I set (batchSz, boxNum, classNum + 4 + 8) as predicted (classNum include background class, e.g. for PASCAL it is 21: 20 classes + background). Here I have also 4 predictions of prior "modification" and 8 numbers with prior box coordinates. For ground truth I assume one-hot encoding also including background class, so, it's also (batchSz, boxNum, classNum + 4 + 8). Actually, y_true[:, :, -8] (which denote if the prior box was matched to some gt box) is redundant because of background class in ground truth, but Keras expects predictions and gt tensors have the same shape, so, I decided to use unnecessary parts of gt tensor for some convenient things (we don't need to provide prior boxes' coordinates in gt). Unfortunately, I can't imagine, how you can have positive and negative boxes fixed. For every image the set of priors, that is considered as positives, is different and, moreover, priors that are considered as negatives are also selected dynamically because of the number of positives and net predictions. Last but not least is that L1-smooth loss is the original implementation is added only for positives. So, if you explain, how you have selected your positives and negative, I can try to answer.

mks0601 commented 7 years ago

Thanks for answering. So, you predict the given boxes to 21 classes in pascal voc in both of training and test time. However, in training time, if i see the background as another category, the number of 'aeroplane' positive boxes would be much smaller than the number of background class because aeroplane is a category in positive class. I think this is a very hard problem and surprised to accomplish this classification problem.

Anyway, i can not have a fixed set of pos, neg set. I mean in torch7, after i select the pos and neg set for each image(in mini-batch), i can extract the output tensor of CNN corresponding to the pos boxes and neg boxes. After extract the tensors, i can get loss and backward gradient for only selected tensors with cross entropy loss.

So in my case, y_pred is (batchSz, boxNum, classNum) where boxNum is the summation of the number of matched pos_box and mined neg_box and classNum include background class.

rykov8 commented 7 years ago

You are right, that during training we have a significant imbalance between the positive and negative training examples. That is why the authors introduce hard negative mining and I actually reproduce their idea of selecting at most 3 times more negative boxes to penalize and such boxes are selected based on current predictions. Actually, we decide to penalize such background priors, for which the net predicts most probable category class, excluding background, in other words we select priors, for which the net has the highest classification error, because it "sees" something, but should see background.

I cannot say, that I am familiar with torch, but if by selection of pos and neg sets for each image in mini-batch you mean the procedure, that I described before, I believe, that in this case your implementation is the same as mine. I think one can imagine other strategies how to select pos and neg sets, but the main idea is to penalize not all background priors, but only some part of them.

timothyman commented 7 years ago

@rykov8 Thanks for putting your implementation of SSD on github. It helped me to get started with my own project.

I am running into the same phenomenon as @mks0601 with my own dataset (not Pascal VOC), i.e. classification is biased towards the background class. I am new to this, but from what I understand hard negative mining tries to fight false positives, but in our case we have more false negatives, correct? If so, is whatever strength parameter for hard negative mining set too high? Is there an easy way to change this?

Thank you for your reply.