jsk-ros-pkg / jsk_recognition

JSK perception ROS packages
https://github.com/jsk-ros-pkg/jsk_recognition
269 stars 190 forks source link

jsk_perception's train_ssd.py raises CUDNN_STATUS_EXECUTION_FAILED error #2506

Open mqcmd196 opened 4 years ago

mqcmd196 commented 4 years ago

When I run train_ssd.py, it raises CUDNN_STATUS_EXECUTION_FAILED. I confirmed my cupy and cuda version correct. If you have solution, please give me some advice. I also reported this trouble on https://github.com/cupy/cupy/issues/3358 .

rosrun jsk_perception train_ssd.py --train-dataset-dir ./train/dataset_voc/ --val-dataset-dir ./test/dataset_voc/
/usr/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
chainer_mask_rcnn cannot be imported.
Exception in main training loop: CUDNN_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/chainer/training/trainer.py", line 316, in run
    update()
  File "/usr/local/lib/python2.7/dist-packages/chainer/training/updaters/standard_updater.py", line 175, in update
    self.update_core()
  File "/usr/local/lib/python2.7/dist-packages/chainer/training/updaters/standard_updater.py", line 187, in update_core
    optimizer.update(loss_func, *in_arrays)
  File "/usr/local/lib/python2.7/dist-packages/chainer/optimizer.py", line 864, in update
    loss = lossfun(*args, **kwds)
  File "/usr/local/lib/python2.7/dist-packages/chainer/link.py", line 294, in __call__
    out = forward(*args, **kwargs)
  File "/home/yoshiki/research_ws/src/jsk_recognition/jsk_perception/scripts/train_ssd.py", line 69, in forward
    mb_locs, mb_confs = self.model(imgs)
  File "/usr/local/lib/python2.7/dist-packages/chainer/link.py", line 294, in __call__
    out = forward(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/chainercv/links/model/ssd/ssd.py", line 130, in forward
    return self.multibox(self.extractor(x))
  File "/usr/local/lib/python2.7/dist-packages/chainer/link.py", line 294, in __call__
    out = forward(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/chainercv/links/model/ssd/ssd_vgg16.py", line 203, in forward
    ys = super(VGG16Extractor512, self).forward(x)
  File "/usr/local/lib/python2.7/dist-packages/chainercv/links/model/ssd/ssd_vgg16.py", line 70, in forward
    h = F.relu(self.conv1_1(x))
  File "/usr/local/lib/python2.7/dist-packages/chainer/link.py", line 294, in __call__
    out = forward(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/chainer/links/connection/convolution_2d.py", line 184, in forward
    groups=self.groups)
  File "/usr/local/lib/python2.7/dist-packages/chainer/functions/connection/convolution_2d.py", line 589, in convolution_2d
    y, = fnode.apply(args)
  File "/usr/local/lib/python2.7/dist-packages/chainer/function_node.py", line 321, in apply
    outputs = self.forward(in_data)
  File "/usr/local/lib/python2.7/dist-packages/chainer/function_node.py", line 512, in forward
    return self.forward_gpu(inputs)
  File "/usr/local/lib/python2.7/dist-packages/chainer/functions/connection/convolution_2d.py", line 189, in forward_gpu
    return self._forward_cudnn(x, W, b, y)
  File "/usr/local/lib/python2.7/dist-packages/chainer/functions/connection/convolution_2d.py", line 250, in _forward_cudnn
    auto_tune=auto_tune, tensor_core=tensor_core)
  File "cupy/cudnn.pyx", line 1575, in cupy.cudnn.convolution_forward
  File "cupy/cuda/cudnn.pyx", line 1211, in cupy.cuda.cudnn.convolutionForward
  File "cupy/cuda/cudnn.pyx", line 715, in cupy.cuda.cudnn.check_status
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/home/yoshiki/research_ws/src/jsk_recognition/jsk_perception/scripts/train_ssd.py", line 252, in <module>
    main()
  File "/home/yoshiki/research_ws/src/jsk_recognition/jsk_perception/scripts/train_ssd.py", line 248, in main
    trainer.run()
  File "/usr/local/lib/python2.7/dist-packages/chainer/training/trainer.py", line 349, in run
    six.reraise(*exc_info)
  File "/usr/local/lib/python2.7/dist-packages/chainer/training/trainer.py", line 316, in run
    update()
  File "/usr/local/lib/python2.7/dist-packages/chainer/training/updaters/standard_updater.py", line 175, in update
    self.update_core()
  File "/usr/local/lib/python2.7/dist-packages/chainer/training/updaters/standard_updater.py", line 187, in update_core
    optimizer.update(loss_func, *in_arrays)
  File "/usr/local/lib/python2.7/dist-packages/chainer/optimizer.py", line 864, in update
    loss = lossfun(*args, **kwds)
  File "/usr/local/lib/python2.7/dist-packages/chainer/link.py", line 294, in __call__
    out = forward(*args, **kwargs)
  File "/home/yoshiki/research_ws/src/jsk_recognition/jsk_perception/scripts/train_ssd.py", line 69, in forward
    mb_locs, mb_confs = self.model(imgs)
  File "/usr/local/lib/python2.7/dist-packages/chainer/link.py", line 294, in __call__
    out = forward(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/chainercv/links/model/ssd/ssd.py", line 130, in forward
    return self.multibox(self.extractor(x))
  File "/usr/local/lib/python2.7/dist-packages/chainer/link.py", line 294, in __call__
    out = forward(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/chainercv/links/model/ssd/ssd_vgg16.py", line 203, in forward
    ys = super(VGG16Extractor512, self).forward(x)
  File "/usr/local/lib/python2.7/dist-packages/chainercv/links/model/ssd/ssd_vgg16.py", line 70, in forward
    h = F.relu(self.conv1_1(x))
  File "/usr/local/lib/python2.7/dist-packages/chainer/link.py", line 294, in __call__
    out = forward(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/chainer/links/connection/convolution_2d.py", line 184, in forward
    groups=self.groups)
  File "/usr/local/lib/python2.7/dist-packages/chainer/functions/connection/convolution_2d.py", line 589, in convolution_2d
    y, = fnode.apply(args)
  File "/usr/local/lib/python2.7/dist-packages/chainer/function_node.py", line 321, in apply
    outputs = self.forward(in_data)
  File "/usr/local/lib/python2.7/dist-packages/chainer/function_node.py", line 512, in forward
    return self.forward_gpu(inputs)
  File "/usr/local/lib/python2.7/dist-packages/chainer/functions/connection/convolution_2d.py", line 189, in forward_gpu
    return self._forward_cudnn(x, W, b, y)
  File "/usr/local/lib/python2.7/dist-packages/chainer/functions/connection/convolution_2d.py", line 250, in _forward_cudnn
    auto_tune=auto_tune, tensor_core=tensor_core)
  File "cupy/cudnn.pyx", line 1575, in cupy.cudnn.convolution_forward
  File "cupy/cuda/cudnn.pyx", line 1211, in cupy.cuda.cudnn.convolutionForward
  File "cupy/cuda/cudnn.pyx", line 715, in cupy.cuda.cudnn.check_status
cupy.cuda.cudnn.CuDNNError: CUDNN_STATUS_EXECUTION_FAILED

I also exported debug.log. cudnn_dubug.log

These are my conditions. I installed cupy-cuda91 by pip2. python -c 'import cupy; cupy.show_config()' shows

CuPy Version          : 6.7.0
CUDA Root             : /usr
CUDA Build Version    : 9010
CUDA Driver Version   : 10020
CUDA Runtime Version  : 9010
cuDNN Build Version   : 7102
cuDNN Version         : 7102
NCCL Build Version    : 2115
NCCL Runtime Version  : (unknown)

nvcc -V shows

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

OS:Ubuntu 18.04.4 64bit GPU:NVIDIA RTX2080Ti

Thanks.

zlg9folira commented 3 years ago

Does JSK_Perception require Nvidia/CUDA ? I am trying to install this on arm64 buster (no CUDA support). I get the same error , however, I cannot install Cupy on this system.

knorth55 commented 3 years ago

Does JSK_Perception require Nvidia/CUDA ? I am trying to install this on arm64 buster (no CUDA support). I get the same error , however, I cannot install Cupy on this system.

No. We don't need CUDA for normal usage. however, if you want to try deep learning nodes (ssd, faster-rcnn, mask-rcnn, etc), you need good gpu and cuda.