NaN - Githubissues

mjkwon2021 / CAT-Net

Official code for CAT-Net: Compression Artifact Tracing Network. Image manipulation detection and localization.

210 stars 25 forks source link

NaN #10

Open dan326326 opened 2 years ago

dan326326 commented 2 years ago

Thank you again for your answer. Now there is a new problem, Loss is NaN at the beginning of training. Do you know how to solve this problem？

Epoch: [1/200] Iter:[280/934], Time: 7.62, lr: 0.0049, Loss: nan NaN or Inf found in input tensor.

CauchyComplete commented 2 years ago

Hi, I cannot infer errors from this one sentence. You should debug carefully. Follow your input image and find what makes nan.

dan326326 commented 2 years ago

Thank you for your reply. Is there a uniform size requirement for pictures

CauchyComplete commented 2 years ago

It is recommended to use images larger than 512x512, but smaller ones are okay if their portion is not large. You don’t need fixed size images because they will be cropped automatically to 512x512.

CauchyComplete commented 2 years ago

One possible reason is that some of your images might be actually non-JPEG but have .jpg extension. Or some images might be just corrupted. Modify your code to print a filename when an error occurs and remove that image from training set.

dan326326 commented 2 years ago

Thank you very much indeed. I located the generation of this nan, and the data became nan after passing through the first convolution layer conv1，What's going on here

CauchyComplete commented 2 years ago

That’s weird…

CauchyComplete commented 2 years ago

You may post the image that is causing errors.

dan326326 commented 2 years ago

hello， thank for your patient reply , this is the running result ! Epoch: [0/200] Iter:[0/1176], Time: 1100.00, lr: 0.005000, Loss: 0.691021 Epoch: [0/200] Iter:[10/1176], Time: 101.08, lr: 0.005000, Loss: nan NaN or Inf found in input tensor. Epoch: [0/200] Iter:[20/1176], Time: 53.51, lr: 0.005000, Loss: nan NaN or Inf found in input tensor. ...

dan326326 commented 2 years ago

I find is that the result of convolution is huge：

tensor([[[[-7.9044e+31, 1.3967e+32, -1.7841e+32, ..., 1.3038e+32, -3.7658e+32, 3.4734e+32], [-1.0282e+32, -8.2208e+31, -2.6121e+30, ..., 1.0440e+32, 8.8422e+31, 7.1321e+30], [ 3.3423e+31, 1.6554e+32, 4.9708e+31, ..., -2.4600e+32, -5.2235e+32, 2.3235e+32], ...,

CauchyComplete commented 2 years ago

Please upload that image file. I'll test it on my computer.

FathUMinUllah3797 commented 11 months ago

Screenshot 2023-08-16 182710 Same the case with me for CASIA2 dataset. Any solution?