G_GAN,G_L1,D_real,D_fake are all 'NaN' after a while

junyanz / pytorch-CycleGAN-and-pix2pix

Image-to-Image Translation in PyTorch

Other

22.8k stars 6.29k forks source link

G_GAN,G_L1,D_real,D_fake are all 'NaN' after a while #1422

Closed Narua2010 closed 2 years ago

Narua2010 commented 2 years ago

When I train with pix2pix, I always get "NaN" as return value after some epochs. I have tested it with my own datasets as well as with the facades dataset. However, I always got the same error.

After searching for it, I found that I should adjust the learning rate. After that the test ran longer, but after a while the same error occurred.

I really don't know what the losses of G_GAN, G_L1, D_real and D_fake are due to during training.

LuuuXG commented 2 years ago

the same question. I adjusted the batch size and learning rate, but 'nan issue' still occured.

Narua2010 commented 2 years ago

I debugged around a bit. When I train with the CPU it runs through without problems. Only on the GPU does the error appear. However, here from the beginning the results are not usable, which makes me think that either CUDA is doing a problem or I'm overlooking some settings that could fix the problem.

I would be very happy if someone could help me with the settings.

Narua2010 commented 2 years ago

I have updated all Conda dependencies and now the AI runs through successfully on the GPU. In addition, I had to adjust some places in the code as they were overhauled with the newer versions.

LuuuXG commented 2 years ago

Thanks a lot! My GPU is [NVIDIA GeForce RTX 3050 Ti Laptop GPU]. I changed CUDA from version 11 to 10.2 and download the corresponding 'torch' package, then the codes run through without Nan problem!

wangyeheng123 commented 2 years ago

Hi, I have the same problem, can I fix it without lowering the CUDA version?

junyanz commented 2 years ago

This is an interesting issue. So it only happened for CUDA 11? @SsnL @taesungp any thoughts?

ssnl commented 2 years ago

IIRC, there were some issues with certain pytorch versions and CUDA 11.0. Glad to know it works for 10.2. Another thing to try is to use the latest PyTorch with supported CUDA version.