Hello, I'm having some problems. RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED - Githubissues

tohinz / ConSinGAN

PyTorch implementation of "Improved Techniques for Training Single-Image GANs" (WACV-21)

MIT License

427 stars 70 forks source link

Hello, I'm having some problems. RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #17

Open zhangkuncsdn opened 3 years ago

zhangkuncsdn commented 3 years ago

Training model (TrainedModels/pantheon/2021_02_22_15_28_21_generation_train_depth_3_lr_scale_0.1_act_lrelu_0.05) Training model with the following parameters: number of stages: 6 number of concurrently trained stages: 3 learning rate scaling: 0.1 non-linearity: lrelu Training on image pyramid: [torch.Size([1, 3, 26, 42]), torch.Size([1, 3, 31, 51]), torch.Size([1, 3, 40, 66]), torch.Size([1, 3, 57, 94]), torch.Size([1, 3, 106, 175]), torch.Size([1, 3, 152, 250])]

stage [0/5]:: 0%| | 0/1000 [00:00<?, ?it/s]T raceback (most recent call last): File "main_train.py", line 118, in train(opt) File "G:\ConSinGAN\ConSinGAN\training_generation.py", line 48, in train fixed_noise, noise_amp, generator, d_curr = train_single_scale(d_curr, generator, reals, fixed_noise, noise_amp, opt, scale_num, writer) File "G:\ConSinGAN\ConSinGAN\training_generation.py", line 156, in train_single_scale gradient_penalty = functions.calc_gradient_penalty(netD, real, fake, opt.lambda_grad, opt.device) File "G:\ConSinGAN\ConSinGAN\functions.py", line 122, in calc_gradient_penalty create_graph=True, retain_graph=True, only_inputs=True)[0] File "D:\Anaconda3\envs\ConSinGAN\lib\site-packages\torch\autograd__init__.py", line 149, in grad inputs, allow_unused) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

tohinz commented 3 years ago

Hi, that looks more like a problem with your Pytorch installation. Are you sure you have the correct CUDA and CUDNN version installed for your graphic card and Pytorch version?

zhangkuncsdn commented 3 years ago

嗨，这看起来更像是您的Pytorch安装问题。您确定为图形卡和Pytorch版本安装了正确的CUDA和CUDNN版本吗？

Hi, does the current ConsinGAN environment support Pytorch 1.7?

tohinz commented 3 years ago

I haven't tested it with Pytorch 1.7 but in general it should work (I assume at least it would give you a different error message from the one above). The error is thrown at the torch.autograd.grad() function which is why I believe it's a problem with your environment and not with the code itself. I would suggest running the code on CPU (use flag --not_cuda) to see if it works on CPU or if you get a more informative error message. I haven't tested it on CPU myself so you might have to add .to(torch.device('cpu')) at some points if Pytorch raises errors about GPU/CPU mismatch.

zhangkuncsdn commented 3 years ago

I haven't tested it with Pytorch 1.7 but in general it should work (I assume at least it would give you a different error message from the one above). The error is thrown at the torch.autograd.grad() function which is why I believe it's a problem with your environment and not with the code itself. I would suggest running the code on CPU (use flag --not_cuda) to see if it works on CPU or if you get a more informative error message. I haven't tested it on CPU myself so you might have to add .to(torch.device('cpu')) at some points if Pytorch raises errors about GPU/CPU mismatch.

Thank you very much. Use Flag -- Not CUDA can run.There's another question I'd like to ask you.If I want to input a single channel grayscale image for training, how should I modify the network?

tohinz commented 3 years ago

Just set --nc_im 1 and represent your image as shape (H x W x 1), i.e. 1 channel instead of 3 for RGB

zhangkuncsdn commented 3 years ago

Just set --nc_im 1 and represent your image as shape (H x W x 1), i.e. 1 channel instead of 3 for RG I've got --nc_im 1, but I'm running into the following problem. Training model (TrainedModels/07/2021_02_24_22_00_10_generation_train_depth_3_lr_scale_0.1_act_lrelu_0.05) Training model with the following parameters: number of stages: 6 number of concurrently trained stages: 3 learning rate scaling: 0.1 non-linearity: lrelu Traceback (most recent call last): File "main_train.py", line 118, in train(opt) File "G:\ConSinGAN\ConSinGAN\training_generation.py", line 23, in train real = functions.adjust_scales2image(real, opt) File "G:\ConSinGAN\ConSinGAN\functions.py", line 185, in adjustscales2image real = imresize(real, opt.scale1, opt) File "G:\ConSinGAN\ConSinGAN\imresize.py", line 52, in imresize im = np2torch(im,opt) File "G:\ConSinGAN\ConSinGAN\imresize.py", line 26, in np2torch x = color.rgb2gray(x) File "D:\Anaconda3\envs\ConSinGAN\lib\site-packages\skimage\color\colorconv.py", line 799, in rgb2gray rgb = _prepare_colorarray(rgb[..., :3]) File "D:\Anaconda3\envs\ConSinGAN\lib\site-packages\skimage\color\colorconv.py", line 152, in _prepare_colorarray raise ValueError(msg) ValueError: the input array must be have a shape == (.., ..,[ ..,] 3)), got (164, 250, 1)

tohinz commented 3 years ago

You will have to change the code slightly then to adapt to this. Another easy work-around is to just convert your gray-scale image to a "color image" with 3 channels, e.g. with OpenCV cv2.cvtColor(gray_img, cv.CV_GRAY2RGB)

zhangkuncsdn commented 3 years ago

You will have to change the code slightly then to adapt to this. Another easy work-around is to just convert your gray-scale image to a "color image" with 3 channels, e.g. with OpenCV cv2.cvtColor(gray_img, cv.CV_GRAY2RGB)

There are some problems when I change the code. Can you give me some advice?

tohinz commented 3 years ago

What are the problems?

FluppyBird commented 1 year ago

I had the same problem 3 days ago, and I used conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch to unexpectedly ran it. This vision of torch is the same as SinGAN, maybe you can try it. : )

LiJuanapple commented 2 months ago

Training model (TrainedModels/pantheon/2021_02_22_15_28_21_generation_train_depth_3_lr_scale_0.1_act_lrelu_0.05) Training model with the following parameters: number of stages: 6 number of concurrently trained stages: 3 learning rate scaling: 0.1 non-linearity: lrelu Training on image pyramid: [torch.Size([1, 3, 26, 42]), torch.Size([1, 3, 31, 51]), torch.Size([1, 3, 40, 66]), torch.Size([1, 3, 57, 94]), torch.Size([1, 3, 106, 175]), torch.Size([1, 3, 152, 250])]

stage [0/5]:: 0%| | 0/1000 [00:00<?, ?it/s]T raceback (most recent call last): File "main_train.py", line 118, in train(opt) File "G:\ConSinGAN\ConSinGAN\training_generation.py", line 48, in train fixed_noise, noise_amp, generator, d_curr = train_single_scale(d_curr, generator, reals, fixed_noise, noise_amp, opt, scale_num, writer) File "G:\ConSinGAN\ConSinGAN\training_generation.py", line 156, in train_single_scale gradient_penalty = functions.calc_gradient_penalty(netD, real, fake, opt.lambda_grad, opt.device) File "G:\ConSinGAN\ConSinGAN\functions.py", line 122, in calc_gradient_penalty create_graph=True, retain_graph=True, only_inputs=True)[0] File "D:\Anaconda3\envs\ConSinGAN\lib\site-packages\torch\autogradinit.py", line 149, in grad inputs, allow_unused) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I have met the same question. At last, how can you resolve the problem? Thank you very much for your sharing and guidance.