open-mmlab / mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
https://mmaction2.readthedocs.io
Apache License 2.0
4.23k stars 1.24k forks source link

resume training cuDNN error #860

Closed richardkxu closed 3 years ago

richardkxu commented 3 years ago

Describe the bug

Hi, I have encountered the following error when resume training by setting: resume_from = 'work_dirs/ircsn_ig65m_pretrained_r152_16x1x1_58e_ucf101_rgb/epoch_20.pth'. I am wondering what might be the cause of the error? Thank you!

Traceback (most recent call last):
  File "/home/richardkxu/Documents/mmaction2/tools/train.py", line 199, in <module>
    main()
  File "/home/richardkxu/Documents/mmaction2/tools/train.py", line 195, in main
    meta=meta)
  File "/home/richardkxu/Documents/mmaction2/mmaction/apis/train.py", line 163, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/richardkxu/anaconda3/envs/mmactionv2-art/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/richardkxu/anaconda3/envs/mmactionv2-art/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/home/richardkxu/anaconda3/envs/mmactionv2-art/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/richardkxu/anaconda3/envs/mmactionv2-art/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 27, in after_train_iter
    runner.outputs['loss'].backward()
  File "/home/richardkxu/anaconda3/envs/mmactionv2-art/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/richardkxu/anaconda3/envs/mmactionv2-art/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
Killing subprocess 18312
Killing subprocess 18313
Traceback (most recent call last):
  File "/home/richardkxu/anaconda3/envs/mmactionv2-art/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/richardkxu/anaconda3/envs/mmactionv2-art/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/richardkxu/anaconda3/envs/mmactionv2-art/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/richardkxu/anaconda3/envs/mmactionv2-art/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/richardkxu/anaconda3/envs/mmactionv2-art/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/richardkxu/anaconda3/envs/mmactionv2-art/bin/python', '-u', '/home/richardkxu/Documents/mmaction2/tools/train.py', '--local_rank=1', 'configs/recognition/csn/ircsn_ig65m_pretrained_r152_16x1x1_58e_ucf101_rgb.py', '--launcher', 'pytorch', '--validate', '--test-last', '--test-best', '--seed', '0', '--deterministic']' returned non-zero exit status 1.
congee524 commented 3 years ago

Could you provide your version of PyTorch, TorchVision, TorchAudio (when been installed), CudaToolkit, CUDA, and the NVidia driver?

richardkxu commented 3 years ago

Yes,

PyTorch: 1.8.1 py3.7_cuda10.2_cudnn7.6.5_0 TorchVision: 0.9.1 TorchAudio (when been installed): 0.8.1 CudaToolkit: 10.2.89 CUDA: 10.2 NVidia driver: 460.27.04

congee524 commented 3 years ago

https://discuss.pytorch.org/t/cuda-error-cublas-status-internal-error-when-calling-cublascreate-handle/114341/10

Does this help you? I don't know how to solve this problem...

innerlee commented 3 years ago

It is likely to be an environment or pytorch issue.