Open owenvt1 opened 4 years ago
Solved by reducing the learning rate to 0.00001
Late to the party, but this is a problem with the new version of torchvision 0.5. It is not easy to predict (and at times it happens in places you won't look) but it gives nans during the training (or even inference). In this case, it gives nan/inf (as you described) even just training on VOC.
The solution is to downgrade torchvision (I am using now 0.2 without problems).
@TheRevanchist i got same issue,i have reduced learning rate and downgraded the torchvision, what torch do you use?
I found this was an issue with one of the data samples annotation
<bndbox>
<xmin>0</xmin>
<ymin>370</ymin>
<xmax>0</xmax>
<ymax>407</ymax>
</bndbox>
xmin and xmax are in the same place which doesn't make sense, this was a fault of the data augmentation tool. I found it was the issue by first providing the training 1 sample, then 10 samples, then 100 etc until I saw the failure at 10,000 and back tracked to find the exact offending sample
This happens when running the examples for training on VOC, as well as Open Images. My goal is to train on Open Images.