open-mmlab / mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
https://mmaction2.readthedocs.io
Apache License 2.0
4.21k stars 1.23k forks source link

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm #2793

Open learn-learn202 opened 7 months ago

learn-learn202 commented 7 months ago

Branch

main branch (1.x version, such as v1.0.0, or dev-1.x branch)

Prerequisite

Environment

CUDA_HOME: /usr/local/cuda-11.1 NVCC: Cuda compilation tools, release 11.1, V11.1.105 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.8.1+cu111 PyTorch compiling details: PyTorch built with:

TorchVision: 0.9.1+cu111 OpenCV: 4.9.0 MMEngine: 0.10.3 MMAction2: 1.2.0+4d6c934 MMCV: 2.1.0

Describe the bug

Traceback (most recent call last): File "tools/train.py", line 145, in main() File "tools/train.py", line 141, in main runner.train() File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train model = self.train_loop.run() # type: ignore File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run self.run_epoch() File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch self.run_iter(idx, data_batch) File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter outputs = self.runner.model.train_step( File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 116, in train_step optim_wrapper.update_params(parsed_losses) File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 196, in update_params self.backward(loss) File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 220, in backward loss.backward(**kwargs) File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

I encountered this issue when running all three methods(mvit,timesformer,uniformerv2)

Reproduces the problem - code sample

No response

Reproduces the problem - command or script

python tools/train.py configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py --seed=0 --deterministic

Reproduces the problem - error message

No response

Additional information

No response

Ash-one commented 4 months ago

same issue