Closed Narua2010 closed 2 years ago
the same question. I adjusted the batch size and learning rate, but 'nan issue' still occured.
I debugged around a bit. When I train with the CPU it runs through without problems. Only on the GPU does the error appear. However, here from the beginning the results are not usable, which makes me think that either CUDA is doing a problem or I'm overlooking some settings that could fix the problem.
I would be very happy if someone could help me with the settings.
I have updated all Conda dependencies and now the AI runs through successfully on the GPU. In addition, I had to adjust some places in the code as they were overhauled with the newer versions.
Thanks a lot! My GPU is [NVIDIA GeForce RTX 3050 Ti Laptop GPU]. I changed CUDA from version 11 to 10.2 and download the corresponding 'torch' package, then the codes run through without Nan problem!
Hi, I have the same problem, can I fix it without lowering the CUDA version?
This is an interesting issue. So it only happened for CUDA 11? @SsnL @taesungp any thoughts?
IIRC, there were some issues with certain pytorch versions and CUDA 11.0. Glad to know it works for 10.2. Another thing to try is to use the latest PyTorch with supported CUDA version.
When I train with pix2pix, I always get "NaN" as return value after some epochs. I have tested it with my own datasets as well as with the facades dataset. However, I always got the same error.
After searching for it, I found that I should adjust the learning rate. After that the test ran longer, but after a while the same error occurred.
I really don't know what the losses of G_GAN, G_L1, D_real and D_fake are due to during training.