Traceback (most recent call last):
File "tools/train.py", line 145, in
main()
File "tools/train.py", line 141, in main
runner.train()
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run
self.run_epoch()
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
self.run_iter(idx, data_batch)
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter
outputs = self.runner.model.train_step(
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 116, in train_step
optim_wrapper.update_params(parsed_losses)
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 196, in update_params
self.backward(loss)
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 220, in backward
loss.backward(**kwargs)
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
I encountered this issue when running all three methods(mvit,timesformer,uniformerv2)
Branch
main branch (1.x version, such as
v1.0.0
, ordev-1.x
branch)Prerequisite
Environment
CUDA_HOME: /usr/local/cuda-11.1 NVCC: Cuda compilation tools, release 11.1, V11.1.105 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.8.1+cu111 PyTorch compiling details: PyTorch built with:
TorchVision: 0.9.1+cu111 OpenCV: 4.9.0 MMEngine: 0.10.3 MMAction2: 1.2.0+4d6c934 MMCV: 2.1.0
Describe the bug
Traceback (most recent call last): File "tools/train.py", line 145, in
main()
File "tools/train.py", line 141, in main
runner.train()
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run
self.run_epoch()
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
self.run_iter(idx, data_batch)
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter
outputs = self.runner.model.train_step(
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 116, in train_step
optim_wrapper.update_params(parsed_losses)
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 196, in update_params
self.backward(loss)
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 220, in backward
loss.backward(**kwargs)
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
I encountered this issue when running all three methods(mvit,timesformer,uniformerv2)
Reproduces the problem - code sample
No response
Reproduces the problem - command or script
python tools/train.py configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py --seed=0 --deterministic
Reproduces the problem - error message
No response
Additional information
No response