qfgaohao / pytorch-ssd

MobileNetV1, MobileNetV2, VGG based SSD/SSD-lite implementation in Pytorch 1.0 / Pytorch 0.4. Out-of-box support for retraining on Open Images dataset. ONNX and Caffe2 support. Experiment Ideas like CoordConv.
https://medium.com/@smallfishbigsea/understand-ssd-and-implement-your-own-caa3232cd6ad
MIT License
1.4k stars 531 forks source link

Wrong confidence Softmax outputs from models trained with custom data and custom implementation #119

Open codingzencc opened 4 years ago

codingzencc commented 4 years ago

Thanks for this awesome implementation!

I used your pipeline to train a custom dataset with 2 classes (one as background and another class for the object type as the dataset is only for a single class). I first modified you VOC dataloader to load my custom dataset. I trained upto 30 epochs and got an mAP of about 85% and the results seemed good for baseline. The output from softmax are good as for the background class its giving high probabilities and the object class its giving low probabilities for most boxes. Eg: while debugging I did F.softmax(confidences[3], dim=2). The output was :

tensor([[[0.9768, 0.0232], [0.9870, 0.0130], [0.9799, 0.0201], [0.9844, 0.0156], [0.9764, 0.0236], [0.9813, 0.0187], [0.9867, 0.0133], [0.9884, 0.0116], [0.9872, 0.0128], [0.9907, 0.0093], [0.9801, 0.0199], [0.9821, 0.0179], [0.9889, 0.0111], [0.9872, 0.0128], [0.9912, 0.0088], [0.9825, 0.0175], [0.9857, 0.0143], [0.9703, 0.0297], [0.9604, 0.0396], [0.9536, 0.0464], [0.9770, 0.0230], [0.9749, 0.0251], [0.9900, 0.0100], [0.9703, 0.0297], [0.9703, 0.0297], [0.9674, 0.0326], [0.9892, 0.0108], [0.9872, 0.0128], [0.9875, 0.0125], [0.9742, 0.0258], [0.9825, 0.0175], [0.9739, 0.0261], [0.9940, 0.0060], [0.9780, 0.0220], [0.9919, 0.0081], [0.9521, 0.0479], [0.9629, 0.0371], [0.9741, 0.0259], [0.9720, 0.0280], [0.9738, 0.0262], [0.9778, 0.0222], [0.9709, 0.0291], [0.9773, 0.0227], [0.9692, 0.0308], [0.9769, 0.0231], [0.9823, 0.0177], [0.9821, 0.0179], [0.9762, 0.0238], [0.9783, 0.0217], [0.9781, 0.0219], [0.9793, 0.0207], [0.9798, 0.0202], [0.9805, 0.0195], [0.9638, 0.0362]]])

The output for object class scores sorted torch.sort(F.softmax(confidences[0], dim=2) [0][:,1]) is:

torch.return_types.sort( values=tensor([6.7742e-04, 7.2989e-04, 7.3594e-04, ..., 8.8332e-01, 9.0915e-01, 9.1045e-01]), indices=tensor([ 121, 103, 367, ..., 1083, 1080, 969]))

In the output of the first layer it detects the object class in most test images and the distance between the scores is very high as well. Here it got the output with .91, .90 where the object is and scores in rest of the bounding boxes being very low (in range of e-03 or e-04) where the object is not there. This output is reasonable as its able to detect background in most anchor locations. Similar was the case in other outputs of confidence heads. It was getting high score for the a few anchor boxes for the object class and the output made sense.

To train and experiment more, I've integrated the your repository into my custom training pipeline by modularizing the code in such a way that the class has functions to build_net, train and eval the model.I have not made any change to the files in the vision folder for building the model , except for a few functional changes. I'm able to propely load the models trained in your implementation into my pipeline without problem. I'm able to load the model properly and initialize with pretrained weights from mb2-ssd-lite-mp-0_686.pth. But now that I have trained mobilenet2-ssd-lite in my integrated code for 30 epochs, the softmax outputs from my integrated pipeline is giving out values that are high for all the anchor boxes in object class. I again did F.softmax(confidences[3], dim=2), this is the output:

tensor([[[0.6760, 0.3240], [0.6753, 0.3247], [0.6753, 0.3247], [0.6748, 0.3252], [0.6757, 0.3243], [0.6758, 0.3242], [0.6747, 0.3253], [0.6755, 0.3245], [0.6742, 0.3258], [0.6738, 0.3262], [0.6728, 0.3272], [0.6743, 0.3257], [0.6751, 0.3249], [0.6749, 0.3251], [0.6743, 0.3257], [0.6743, 0.3257], [0.6742, 0.3258], [0.6739, 0.3261], [0.6752, 0.3248], [0.6737, 0.3263], [0.6746, 0.3254], [0.6749, 0.3251], [0.6749, 0.3251], [0.6750, 0.3250], [0.6765, 0.3235], [0.6748, 0.3252], [0.6748, 0.3252], [0.6724, 0.3276], [0.6756, 0.3244], [0.6727, 0.3273], [0.6747, 0.3253], [0.6749, 0.3251], [0.6747, 0.3253], [0.6739, 0.3261], [0.6741, 0.3259], [0.6739, 0.3261], [0.6767, 0.3233], [0.6770, 0.3230], [0.6768, 0.3232], [0.6769, 0.3231], [0.6775, 0.3225], [0.6766, 0.3234], [0.6770, 0.3230], [0.6755, 0.3245], [0.6751, 0.3249], [0.6752, 0.3248], [0.6752, 0.3248], [0.6753, 0.3247], [0.6756, 0.3244], [0.6764, 0.3236], [0.6754, 0.3246], [0.6755, 0.3245], [0.6751, 0.3249], [0.6762, 0.3238]]])

The output for torch.sort(F.softmax(confidences[0], dim=2) [0][:,1]) is

torch.return_types.sort( values=tensor([0.1053, 0.1532, 0.1655, ..., 0.6778,0.7770, 0.8186]), indices=tensor([1084, 970, 968, ..., 969, 1080, 1083]))

Thus it means its able to learn where the object is with high confidence but not able to properly tell where its not in the layers ahead. In confidence[0], compared to the model trained in your original implementation above, the scores of the other anchors in the one trained in my integration are not that distant. In confidence[1] - confidence[5], the confidence outputs among the object class are very close in each layers separately but in higher range. It should ideally be able to output the confidence for the anchors to be in range of 0.0xxx as its doing from your implementation. Eg: in confidence[1] all the values for class score might be 0.29xx, in confidence[3] as given above, all the values are in the range 0.32xx and it is a pattern I was seeing. Sometime it can go in the range of 0.4xxx as well for all the anchor boxes. They only differ in 3rd and 4th decimal place.

I'm getting mAP of 71% on the integrated version as well. The consequence of this is when I run inference on the test images nms ends up outputting a lot of bounding boxes all over the image. If I increase the prebability threshold to eg 0.5, all the bounding boxes in close range go away if they are below 0.5 and do end up having the object class detected. But it does miss out cases it was able to detect with the model trained in the original implementation.

The loss output from your original seem to lower in your original code. But when I'm training from my integration the losses go down as well but not as much. I used cosine lr schelduler in both the runs. I trained mobilenet-1-ssd as well from my implementation but the results are the same.

Orignal implementation loss: orig losses

Losses from my pipeline: inter loss

To me it looks like the the layers for confidence headers from confidence[1] - confidence[5] are not learning and even confidence[0] is not able to learn that well. I'm not able to clearly tell where I'm going wrong. Any suggestions will be welcome to debug such a problem. A pytorch forum thread is the closest I've found to the current problem.

codingzencc commented 4 years ago

Weight decay was 0.1. It should have been 0.0005

Nannigalaxy commented 3 years ago

can you please explain how you got mAP results?