peteanderson80 / Matterport3DSimulator

AI Research Platform for Reinforcement Learning from Real Panoramic Images.
Other
503 stars 130 forks source link

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #71

Open ZuoJiaxing opened 4 years ago

ZuoJiaxing commented 4 years ago

I followed the instruction you post to install the docker on ubuntu 18.04 with RTX 2020Ti. However, when I run the "python3 tasks/R2R/train.py", I got the following error. Actually, I have tried lots of docker images that meet your requirement cuda9.2+cudnn7+pytorch1.1, I can not go through this error! Please help me with this! Thanks!

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py:54: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.5 and num_layers=1 "num_layers={}".format(dropout, num_layers)) Traceback (most recent call last): File "tasks/R2R/train.py", line 163, in train_val() File "tasks/R2R/train.py", line 156, in train_val dropout_ratio, bidirectional=bidirectional).cuda() File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 265, in cuda return self._apply(lambda t: t.cuda(device)) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

ManaalAhi commented 3 years ago

Hello,

There is a conflict between the CUDNN and Pytorch Version. To fix this try changing the PyTorch version. You can find the correct one which is compatible with your CUDA and CUDNN version at the following link: https://pytorch.org/get-started/previous-versions/

staale92 commented 1 year ago

Hi, notice that the error might also be caused by the fact that your GPU is incompatible with CUDA 9.2. I had the same error, and in my case (RTX A6000 GPU) changing the following lines in the Dockerfile:

FROM nvidia/cudagl:9.2-devel-ubuntu18.04

# Install cudnn
ENV CUDNN_VERSION 7.6.4.38
LABEL com.nvidia.cudnn.version="${CUDNN_VERSION}"

RUN apt-get update && apt-get install -y --no-install-recommends \
    libcudnn7=$CUDNN_VERSION-1+cuda9.2 \
libcudnn7-dev=$CUDNN_VERSION-1+cuda9.2 \
&& \
    apt-mark hold libcudnn7 && \
    rm -rf /var/lib/apt/lists/*

with:

FROM nvidia/cudagl:11.1-devel-ubuntu18.04

# Install cudnn
ENV CUDNN_VERSION 7.6.4.38
LABEL com.nvidia.cudnn.version="${CUDNN_VERSION}"

RUN rm /etc/apt/sources.list.d/cuda.list
RUN rm /etc/apt/sources.list.d/nvidia-ml.list

did the trick. Your GPU seems to require CUDA > 10.0 (you can check the CUDA compatibility here).