error using vl_nnconv, cuDNN error, bug with Turing GPU

duancaohui commented 5 years ago

Recently, I get a new computer with Turing GPU (RTX 2080 Ti ), so I set up CUDA, cudnn, and matconvnet in my new computer. I follow the install guidance in http://www.vlfeat.org/matconvnet/install/, everything seems to be going well together:

Systerm Windows 10 CUDA 10 cudnn-10.0-windows10-x64-v7.4.2.24 MATLAB 2018 a matconvnet-1.0-beta25

However, when I train my model using trainFn, an error occurred, these seems a error with cudnn.:

vl_nnconv
vl::impl::dispatch_cudnn<C, CU>::operator(): ConvolutionForwardCudnn<dataType>::operator(): cuDNN error [cudnn:"\\matconvnet-1.0-beta25\\matlab\\src\\bits\\nnconv_cudnn.cu":209
(CUDNN_STATUS_EXECUTION_FAILED)]

This is because MATLAB does not natively support Turing and there may be issues now, there are some answers to resolve it: [1]https://ww2.mathworks.cn/matlabcentral/answers/439616-does-matlab-2018b-support-nvidia-geforce-2080-ti-rtx-for-creating-training-implementing-deep-learnin [2]https://ww2.mathworks.cn/matlabcentral/answers/432027-matlab-cuda-10

this is a known bug with Turing GPU and matconvnet which can be worked-around by running a simple function and ignoring the error

try
    nnet.internal.cnngpu.reluForward(1);
catch ME
end

However, this method can only resolve this error in my test, cannot resolve this error in my training. I add this simply function in my trainFn, this error still occurred!

whisperrrr commented 5 years ago

Hey，the same error occurred when I use vl_nnconv with GPU. But the url you post to solve this error isn't avilable right now.

duancaohui commented 5 years ago

The url is avilable, you can copy this url and open with your explorer:

Free-Cloud commented 5 years ago

My GPU is RTX2070, and I fix this error when I use the CUDA9.0 and update it to Patch 4.

MumuChenGunGun commented 5 years ago

Is there someone who fix this error?

whisperrrr commented 5 years ago

The url is avilable, you can copy this url and open with your explorer:

The url is avilable, you can copy this url and open with your explorer:

Thanks. I replaced cuda10.1 to cuda9.2，and it's worked well

yuanlong-o commented 5 years ago

Hi, I got rtx2080 with cuda9.2, but still get the vl_nnconv error. Could you please share your driver information?

whisperrrr commented 5 years ago

CPU:Intel(R) Core(TM) i9-7920 CPU @ 2.90 GHz GPU:GeForce RTX 2080 Ti

At 2019-10-27 07:21:23, "yuanlong-o" notifications@github.com wrote:

Hi, I got rtx2080 with cuda9.2, but still get the vl_nnconv error. Could you please share your driver information?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

duancaohui commented 5 years ago

Hi, I have fixed this error in the condition of CUDA 10.0 and RTX 2080 Ti, please add the following code before your training or testing: `try test1 = vl_nnconv(gpuArray(zeros(6,6,1,2)), gpuArray(ones(3,3,1,2)), gpuArray(ones(2,1)),'CuDNN'); catch ME end

try test1 = vl_nnbnorm(gpuArray(zeros(6,6,2,2)),gpuArray(ones(2,1)),gpuArray(ones(2,1))); catch ME end

try test1 = vl_nnpool(gpuArray(zeros(6,6,1,2)),2,'pad',0,'stride',2,'method','max'); catch ME end

try test1 = vl_nnconvt(gpuArray(zeros(6,6,16,2)), gpuArray(ones(3,3,16,16)),gpuArray(ones(16,1)), 'crop', [0,1,0,1], 'upsample',2, 'numGroups', 1, 'CuDNN'); catch ME end`

the code is mean that just ignore all the errors, and then all is ok!

duancaohui commented 5 years ago

Hi, I have fixed this error in the condition of CUDA 10.0 and RTX 2080 Ti, please add the following code before your training or testing: `try test1 = vl_nnconv(gpuArray(zeros(6,6,1,2)), gpuArray(ones(3,3,1,2)), gpuArray(ones(2,1)),'CuDNN'); catch ME end

try test1 = vl_nnbnorm(gpuArray(zeros(6,6,2,2)),gpuArray(ones(2,1)),gpuArray(ones(2,1))); catch ME end

try test1 = vl_nnpool(gpuArray(zeros(6,6,1,2)),2,'pad',0,'stride',2,'method','max'); catch ME end

try test1 = vl_nnconvt(gpuArray(zeros(6,6,16,2)), gpuArray(ones(3,3,16,16)),gpuArray(ones(16,1)), 'crop', [0,1,0,1], 'upsample',2, 'numGroups', 1, 'CuDNN'); catch ME end`

the code is mean that just ignore all the errors, and then all is ok!

duancaohui commented 5 years ago

Hi, I have fixed this error in the condition of CUDA 10.0 and RTX 2080 Ti, please add the following code before your training or testing: `try test1 = vl_nnconv(gpuArray(zeros(6,6,1,2)), gpuArray(ones(3,3,1,2)), gpuArray(ones(2,1)),'CuDNN'); catch ME end

try test1 = vl_nnbnorm(gpuArray(zeros(6,6,2,2)),gpuArray(ones(2,1)),gpuArray(ones(2,1))); catch ME end

try test1 = vl_nnpool(gpuArray(zeros(6,6,1,2)),2,'pad',0,'stride',2,'method','max'); catch ME end

try test1 = vl_nnconvt(gpuArray(zeros(6,6,16,2)), gpuArray(ones(3,3,16,16)),gpuArray(ones(16,1)), 'crop', [0,1,0,1], 'upsample',2, 'numGroups', 1, 'CuDNN'); catch ME end`

the code is mean that just ignore all the errors, and then all is ok!

AileenSengupta commented 2 years ago

My GPU is RTX2070, and I fix this error when I use the CUDA9.0 and update it to Patch 4.

Can you please help me with the code for Matlab on how to get rid of the error, I am using CUDA 10

AileenSengupta commented 2 years ago

Hi, I have fixed this error in the condition of CUDA 10.0 and RTX 2080 Ti, please add the following code before your training or testing: `try test1 = vl_nnconv(gpuArray(zeros(6,6,1,2)), gpuArray(ones(3,3,1,2)), gpuArray(ones(2,1)),'CuDNN'); catch ME end

try test1 = vl_nnbnorm(gpuArray(zeros(6,6,2,2)),gpuArray(ones(2,1)),gpuArray(ones(2,1))); catch ME end

try test1 = vl_nnpool(gpuArray(zeros(6,6,1,2)),2,'pad',0,'stride',2,'method','max'); catch ME end

try test1 = vl_nnconvt(gpuArray(zeros(6,6,16,2)), gpuArray(ones(3,3,16,16)),gpuArray(ones(16,1)), 'crop', [0,1,0,1], 'upsample',2, 'numGroups', 1, 'CuDNN'); catch ME end`

the code is mean that just ignore all the errors, and then all is ok!

Hi am struggling with the error still in matlab:

Error using DAGNetwork/classify (line 193) Failed to initialize the cuDNN handle. Return code was CUDNN_STATUS_NOT_INITIALIZED.

I am using GeForce GTX 1080 Ti and Cuda 10.0 but after i tried to remove the exceptions, I still get the same error. Any help is appreciated.

vlfeat / matconvnet

error using vl_nnconv, cuDNN error, bug with Turing GPU #1206