Error: RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

souro commented 3 years ago

I am running the below command: python inference.py --config yelp_config.json --checkpoint working_dir/model.40.ckpt

Getting the below error: opout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1 "num_layers={}".format(dropout, num_layers)) 2021-05-15 13:29:46,985 - INFO - MODEL HAS 9181445 params Load from working_dir/model.40.ckpt sucessful! Traceback (most recent call last): File "inference.py", line 103, in model = model.cuda() File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda return self._apply(lambda t: t.cuda(device)) File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

No Idea why this error ... because my own other python project on gpu is working perfectly ... please let me know if you can figure out something from this. thank you...

rpryzant commented 3 years ago

Hmm yeah it seems like this is a GPU error. Can you give me the output of nvidia-smi? What versions of cuda & pytorch are you using?

souro commented 3 years ago

CUDA version details: Cuda compilation tools, release 10.1, V10.1.105 pytorch version details: 1.1.0

*** I have used your provided requirements.txt only

rpryzant commented 3 years ago

Hmm I wasn't able to reproduce this error. What is your GPU?

Can you give me the output of these commands?

nvidia-smi
python -c 'import torch; print(torch.cuda.is_available()); print(torch.__version__)'

I'd also try upgrading your pytorch beyond what's in the requirements.txt?

wasedaward commented 3 years ago

I have the same trouble. The output for the commands:

nvidia-smi
python -c 'import torch; print(torch.cuda.is_available()); print(torch.version)' is Sat Jun 5 20:14:59 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:01:00.0 Off | N/A | | 24% 34C P8 17W / 250W | 22MiB / 11018MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1183 G /usr/lib/xorg/Xorg 9MiB | | 0 N/A N/A 1694 G /usr/bin/gnome-shell 8MiB | +-----------------------------------------------------------------------------+

True 1.1.0

wasedaward commented 3 years ago

Maybe this is because the pytorch version is 1.1.0 and this version is compatible with cudatoolkit=9.0/10.0, but my device's cuda version is 10.2?

wasedaward commented 3 years ago

Hello, I think I may have solved this problem. Firstly, I ran the requirements.txt. Then I met that trouble. Next, I pip uninstall torch torchvision, and use conda intsall pytorch==1.1.0 torchvison==0.3.0 cudatoolkit=10.0 -c pytorch Finally, I ran python inference.py --config yelp_config.json this code successfully.

rpryzant commented 3 years ago

Excellent!! I will update the FAQ to reflect your fix.

rpryzant / delete_retrieve_generate

Error: RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #29