Open dan326326 opened 2 years ago
Hi, I cannot infer errors from this one sentence. You should debug carefully. Follow your input image and find what makes nan.
Thank you for your reply. Is there a uniform size requirement for pictures
It is recommended to use images larger than 512x512, but smaller ones are okay if their portion is not large. You don’t need fixed size images because they will be cropped automatically to 512x512.
One possible reason is that some of your images might be actually non-JPEG but have .jpg extension. Or some images might be just corrupted. Modify your code to print a filename when an error occurs and remove that image from training set.
Thank you very much indeed. I located the generation of this nan, and the data became nan after passing through the first convolution layer conv1,What's going on here
That’s weird…
You may post the image that is causing errors.
hello, thank for your patient reply , this is the running result ! Epoch: [0/200] Iter:[0/1176], Time: 1100.00, lr: 0.005000, Loss: 0.691021 Epoch: [0/200] Iter:[10/1176], Time: 101.08, lr: 0.005000, Loss: nan NaN or Inf found in input tensor. Epoch: [0/200] Iter:[20/1176], Time: 53.51, lr: 0.005000, Loss: nan NaN or Inf found in input tensor. ...
I find is that the result of convolution is huge:
tensor([[[[-7.9044e+31, 1.3967e+32, -1.7841e+32, ..., 1.3038e+32, -3.7658e+32, 3.4734e+32], [-1.0282e+32, -8.2208e+31, -2.6121e+30, ..., 1.0440e+32, 8.8422e+31, 7.1321e+30], [ 3.3423e+31, 1.6554e+32, 4.9708e+31, ..., -2.4600e+32, -5.2235e+32, 2.3235e+32], ...,
Please upload that image file. I'll test it on my computer.
Same the case with me for CASIA2 dataset. Any solution?
Thank you again for your answer. Now there is a new problem, Loss is NaN at the beginning of training. Do you know how to solve this problem?
Epoch: [1/200] Iter:[280/934], Time: 7.62, lr: 0.0049, Loss: nan NaN or Inf found in input tensor.