sefibk / KernelGAN

Other
337 stars 77 forks source link

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize #61

Open RaySunWHUT opened 3 years ago

RaySunWHUT commented 3 years ago

Hi author, I met this error, I obey the enviroment.yml; but this problem make me crazy! I have no idea why this happens. If I run the ZSSR to get the restored image, I will meet this problem.

tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to s ee if a warning log message was printed above. [[{{node layer_1}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device=" /job:localhost/replica:0/task:0/device:GPU:0"](gradients/layer_1_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, filter_0/read)]] [[{{node add/_17}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task :0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_341_add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

sefibk commented 3 years ago

Did you try following the error message? Was cuDNN initialized properly?

RaySunWHUT commented 3 years ago

Did you try following the error message? Was cuDNN initialized properly?

hi, Mr. sefibk,

Thanks for your reply(you are one of the fastest reply coders I've ever met).

I think I should clear my problem details:

I obey the environment with your supply file "enviroment.yml", the code can estimate the kernel correctly.

But if I run the code with "--SR" in the terminal, such as "python ./train.py -i="./test_images"--SR" this error will happen.

%U`8NEL6KO7YYB04%C74RHP

However, I find some answers on the Internet, they think the error has no relation with "cuDNN initialization"; They think the modification of Tensorflow API leads this problem; Since this error occurs in the ZSSR code, so I agree with them.

Their solution is image

I have been try with this. but not work.

Since I'm not familiar with Tensorflow, so I think maybe I did not modify the code in the correct position.

I think everyone run the code with your supply file "enviroment.yml" will occur this problem; Because I'm not familiar with the Tensorflow framework, so if you can solve this issue, I will be very grateful.

Thank you again.

sefibk commented 3 years ago

Hi, Thx for the kind words. Just to verify:

  1. This is not an issue of this repo but a pure Cudnn problem. Please see this issue
  2. ZSSR is not my code - it is a different work with a separate repository. I chose it to do the SR once my work estimates the kernel. In fact, my repo is in Pytorch and doesn't use Tensorflow at all.

In addition, and sorry for not taking responsibility, the yaml describes only the necessary pip packages and will not verify the environment is completely legit.

sauravsolanki commented 3 years ago

It is version conflict and This solve my problem here. Try to match the verison.

RaySunWHUT commented 3 years ago

I have been tried to match the environment version before, but it seems not to work. I use another repository https://github.com/RomanovIgnat/KernelGAN, which writes the ZSSR with PyTorch.

the environment of .yml as follow:

environment.zip