CUBLAS_STATUS_NOT_SUPPORTED when calling cublasDgemv

Flamefire commented 1 year ago

🐛 Describe the bug

When running the distributed/test_data_parallel or test_nn test it fails with

ERROR: test_data_parallel (__main__.TestDataParallel)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/s3248973-EasyBuild/PyTorch/1.12.1/foss-2021b-CUDA-11.4.1/pytorch-v1.12.1/test/distributed/test_data_parallel.py", line 353, in test_data_parallel
    out = dp.data_parallel(l, i, dev_id)
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 231, in data_parallel
    outputs = parallel_apply(replicas, inputs, module_kwargs, used_device_ids)
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(), Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Cdesc.descriptor(), preference.descriptor(), 1, &heuristicResult, &returnedResult)`

and

ERROR: test_spectral_norm (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/s3248973-EasyBuild/PyTorch/1.12.1/foss-2021b-CUDA-11.4.1/pytorch-v1.12.1/test/test_nn.py", line 4593, in test_spectral_norm
    gradcheck(fn, (input.clone().requires_grad_(),), check_batched_grad=False)
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 3019, in gradcheck
    return torch.autograd.gradcheck(fn, inputs, **kwargs)
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/autograd/gradcheck.py", line 1414, in gradcheck
    return _gradcheck_helper(**args)
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/autograd/gradcheck.py", line 1423, in _gradcheck_helper
    func_out = func(*tupled_inputs)
  File "/dev/shm/s3248973-EasyBuild/PyTorch/1.12.1/foss-2021b-CUDA-11.4.1/pytorch-v1.12.1/test/test_nn.py", line 4590, in fn
    out1 = wrapped_m(input)
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1137, in _call_impl
    result = hook(self, input)
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/utils/spectral_norm.py", line 105, in __call__
    setattr(module, self.name, self.compute_weight(module, do_power_iteration=module.training))
  File "/tmp/easybuild-tmp/eb-HhIeo8/tmpUxRyEj/lib/python3.9/site-packages/torch/nn/utils/spectral_norm.py", line 84, in compute_weight
    v = normalize(torch.mv(weight_mat.t(), u), dim=0, eps=self.eps, out=v)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasDgemv(handle, op, m, n, &alpha, a, lda, x, incx, &beta, y, incy)`

I found that test_nn uses torch.nn.DataParallel when multiple GPUs are present so I assume it is the same issue. That error isn't listed in the NVIDIA docs for cublasDgemv so I don't know why it fails.

Versions

PyTorch 1.12.1
CUDA 11.4.1
Python 3.9

cc @ngimel @csarofeen @ptrblck @xwang233

Flamefire commented 1 year ago

With some experimentation I found that swapping CUDA 11.4 to 11.5.0 resolves the failure. Is this expected?

malfet commented 1 year ago

Can you repro it with PyTorch-1.13.x? Also, mind running python -mtorch.utils.collect_env? (I think GPU types you are testing on are crucial to the problem, as this test was run extensively in CI, but only on sm_52 GPU)

ptrblck commented 1 year ago

With some experimentation I found that swapping CUDA 11.4 to 11.5.0 resolves the failure. Is this expected?

We haven't seen this issue in the past on devices we are testing, but I also don't know which GPUs you are using. As @malfet mentioned: could you use the latest stable release with CUDA 11.7 or the nightly with 11.8 and check if you are still seeing the error?

Flamefire commented 1 year ago

Can you repro it with PyTorch-1.13.x? Also, mind running python -mtorch.utils.collect_env? (I think GPU types you are testing on are crucial to the problem, as this test was run extensively in CI, but only on sm_52 GPU)

Not yet, this is on an HPC environment and we need to build PyTorch from source and I only just got 1.12 running and 1.13 will be some more work. I can try tomorrow if I can repro this with the PyPi 1.12 version and then test the PyPi 1.13

The GPUs are A100. I found a place where it might enable TF32 on A100s before calling cublasDgemv but that is not used (I also completely removed that part so that surely isn't the issue)
It's also strange as the failure doesn't seem to be related to the non-tensor inputs (i.e sizes, alpha, beta, strides etc.) as I logged those and some calls with the same sizes/strides work while others don't.

I can imagine that for the DataParallel case some tensors reside on the "wrong" GPU which is why it fails. But no idea how to get the tensors GPU at the point of the call.

could you use the latest stable release with CUDA 11.7 or the nightly with 11.8 and check if you are still seeing the error?

As mentioned I tried with CUDA 11.5 and it worked. Another toolchain (GCC 11.3, Python 3.10) using CUDA 11.7 also worked. So it might be related to CUDA 11.4. I suspect PyTorch uses some CUDA 11.5+ specifics which are not supported in 11.4

Flamefire commented 1 year ago

I can try tomorrow if I can repro this with the PyPi 1.12 version and then test the PyPi 1.13

That failed. 1.12 isn't built with sm_80 (required for Ampere GPUs) and 1.13 requires CUDA >= 11.5 due to a dependency on libcublasLt.so.11 and also bundles CUDA. So I cannot test PyTorch 1.13 with CUDA 11.4 (currently at least).

ngimel commented 1 year ago

If you can move to cuda 11.7-11.8 instead of 11.4, please do so (you don't need to update your driver for that). Pytorch is not using any 11.5+ specifics, but underlying cublas functions might have bugfixes in 11.5+

ZhiyuanChen commented 1 year ago

I tried on cuda11.7, the issue still presists

Collecting environment information...
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27

Python version: 3.8.16 (default, Jan 17 2023, 23:13:24)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1127.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 515.43.04
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.24.2
[pip3] pytorch-lightning==1.9.1
[pip3] torch==1.13.1
[pip3] torch-cluster==1.6.0+pt113cu117
[pip3] torch-geometric==2.2.0
[pip3] torch-scatter==2.1.0+pt113cu117
[pip3] torch-sparse==0.6.16+pt113cu117
[pip3] torch-spline-conv==1.2.1+pt113cu117
[pip3] torchmetrics==0.11.1
[pip3] torchvision==0.14.1
[conda] blas                      1.0                         mkl    defaults
[conda] mkl                       2022.1.0           hc2b9512_224    defaults
[conda] numpy                     1.24.2                   pypi_0    pypi
[conda] pytorch                   1.13.1          py3.8_cuda11.7_cudnn8.5.0_0    pytorch
[conda] pytorch-cuda              11.7                 h67b0de4_1    pytorch
[conda] pytorch-lightning         1.9.1                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch-cluster             1.6.0+pt113cu117          pypi_0    pypi
[conda] torch-geometric           2.2.0                    pypi_0    pypi
[conda] torch-scatter             2.1.0+pt113cu117          pypi_0    pypi
[conda] torch-sparse              0.6.16+pt113cu117          pypi_0    pypi
[conda] torch-spline-conv         1.2.1+pt113cu117          pypi_0    pypi
[conda] torchmetrics              0.11.1                   pypi_0    pypi
[conda] torchvision               0.14.1                   pypi_0    pypi

xw285cornell commented 1 year ago

It looks cuda12 will work for me (fail at 11.4). @ptrblck do you know any fixes between 11.7 - 12 that can fix this problem. cc. @mikekgfb

ptrblck commented 1 year ago

do you know any fixes between 11.7 - 12 that can fix this problem.

@xw285cornell @mikekgfb No, since none of the mentioned tests in this issue fails for 2.0.1+cu117 on a multi-A100 system and I'm unable to reproduce it. Note that I haven't checked the 11.8 or 12.1 binaries as no failures were reported there at all.

shyringo commented 3 months ago

cuda12.1, multi-A100，facing the same bug.

shyringo commented 3 months ago

cuda12.1, multi-A100，facing the same bug.

update: It turns out it was my casual fault. I used 3node*6gpu A100s, and, however, set CUDA_VISIBLE_DEVICES to be 0,1,2,3,4,5,6,7. When I set CUDA_VISIBLE_DEVICES to be 0,1,2,3,4,5, the bug did not appear.

Yiqiu-Zhang commented 2 months ago

I just set bias=False in a linear function and problem solved.

pytorch / pytorch

CUBLAS_STATUS_NOT_SUPPORTED when calling cublasDgemv #94294

🐛 Describe the bug

Versions