hailanyi / VirConv

Virtual Sparse Convolution for Multimodal 3D Object Detection
https://arxiv.org/abs/2303.02314
Apache License 2.0
276 stars 39 forks source link

[Bug] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) when training VirConv-L model #72

Closed Flying-Angels closed 1 month ago

Flying-Angels commented 1 month ago

During epoch 0/50 and train 3/3712 I got error like this:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

How to handle it? I will display the error in its entirety

Traceback (most recent call last):                                                                                                                                                      | 3/3712 [00:21<5:11:37,  5.04s/it, total_it=3]
  File "/home/zhw/project/virconv/tools/train.py", line 201, in <module>
    main()
  File "/home/zhw/project/virconv/tools/train.py", line 152, in main
    train_model(
  File "/home/zhw/project/virconv/tools/train_utils/train_utils.py", line 95, in train_model
    accumulated_iter = train_one_epoch(
  File "/home/zhw/project/virconv/tools/train_utils/train_utils.py", line 44, in train_one_epoch
    loss, tb_dict, disp_dict = model_func(model, batch)
  File "/home/zhw/project/virconv/pcdet/models/__init__.py", line 32, in model_func
    ret_dict, tb_dict, disp_dict = model(batch_dict)
  File "/home/zhw/conda/anaconda3/envs/virconv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/zhw/project/virconv/pcdet/models/detectors/voxel_rcnn.py", line 10, in forward
    batch_dict = cur_module(batch_dict)
  File "/home/zhw/conda/anaconda3/envs/virconv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/zhw/project/virconv/pcdet/models/backbones_3d/spconv_backbone.py", line 658, in forward
    newx_conv1 = self.vir_conv1(newinput_sp_tensor, batch_size, calib, 1, self.x_trans_train, trans_param)
  File "/home/zhw/conda/anaconda3/envs/virconv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/zhw/project/virconv/pcdet/models/backbones_3d/spconv_backbone.py", line 216, in forward
    uv_coords, depth = index2uv(d3_feat2.indices, batch_size, calib, stride, x_trans_train, trans_param)
  File "/home/zhw/project/virconv/pcdet/models/backbones_3d/spconv_backbone.py", line 72, in index2uv
    pts_rect = calib[b_i].lidar_to_rect_cuda(cur_pts[:, 0:3])
  File "/home/zhw/project/virconv/pcdet/utils/calibration_kitti.py", line 129, in lidar_to_rect_cuda
    pts_rect = torch.matmul(pts_lidar_hom[:], torch.matmul(V2C, R0))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Environment Info "CUDA": { "GPU": [ "NVIDIA GeForce GTX 1080 Ti", "NVIDIA GeForce GTX 1080 Ti" ], "available": true, "version": "11.1" }, "Packages": { "PyTorch_debug": false, "PyTorch_version": "1.8.1+cu111", "TTS": "0.6.1", "numpy": "1.19.5" }, "System": { "OS": "Linux", "architecture": [ "64bit", "ELF" ], "processor": "x86_64", "python": "3.9.19", "version": "#213-Ubuntu SMP Fri Aug 2 19:14:16 UTC 2024" }

Flying-Angels commented 1 month ago

After upgrading PyTorch to version 1.10.1, this error no longer occurs.