qfgaohao / pytorch-ssd

MobileNetV1, MobileNetV2, VGG based SSD/SSD-lite implementation in Pytorch 1.0 / Pytorch 0.4. Out-of-box support for retraining on Open Images dataset. ONNX and Caffe2 support. Experiment Ideas like CoordConv.
https://medium.com/@smallfishbigsea/understand-ssd-and-implement-your-own-caa3232cd6ad
MIT License
1.4k stars 533 forks source link

Average regression loss and classification loss nan, average regression loss inf when training vgg16 model #91

Open owenvt1 opened 4 years ago

owenvt1 commented 4 years ago

This happens when running the examples for training on VOC, as well as Open Images. My goal is to train on Open Images.

owenvt1 commented 4 years ago

Solved by reducing the learning rate to 0.00001

TheRevanchist commented 4 years ago

Late to the party, but this is a problem with the new version of torchvision 0.5. It is not easy to predict (and at times it happens in places you won't look) but it gives nans during the training (or even inference). In this case, it gives nan/inf (as you described) even just training on VOC.

The solution is to downgrade torchvision (I am using now 0.2 without problems).

ijalalfrz commented 4 years ago

@TheRevanchist i got same issue,i have reduced learning rate and downgraded the torchvision, what torch do you use?

Abdob commented 3 years ago

I found this was an issue with one of the data samples annotation

<bndbox>
<xmin>0</xmin>
<ymin>370</ymin>
<xmax>0</xmax>
<ymax>407</ymax>
</bndbox>

xmin and xmax are in the same place which doesn't make sense, this was a fault of the data augmentation tool. I found it was the issue by first providing the training 1 sample, then 10 samples, then 100 etc until I saw the failure at 10,000 and back tracked to find the exact offending sample