Open Flamefire opened 1 year ago
With some experimentation I found that swapping CUDA 11.4 to 11.5.0 resolves the failure. Is this expected?
Can you repro it with PyTorch-1.13.x? Also, mind running python -mtorch.utils.collect_env
? (I think GPU types you are testing on are crucial to the problem, as this test was run extensively in CI, but only on sm_52 GPU)
With some experimentation I found that swapping CUDA 11.4 to 11.5.0 resolves the failure. Is this expected?
We haven't seen this issue in the past on devices we are testing, but I also don't know which GPUs you are using. As @malfet mentioned: could you use the latest stable release with CUDA 11.7 or the nightly with 11.8 and check if you are still seeing the error?
Can you repro it with PyTorch-1.13.x? Also, mind running
python -mtorch.utils.collect_env
? (I think GPU types you are testing on are crucial to the problem, as this test was run extensively in CI, but only on sm_52 GPU)
Not yet, this is on an HPC environment and we need to build PyTorch from source and I only just got 1.12 running and 1.13 will be some more work. I can try tomorrow if I can repro this with the PyPi 1.12 version and then test the PyPi 1.13
The GPUs are A100. I found a place where it might enable TF32 on A100s before calling cublasDgemv
but that is not used (I also completely removed that part so that surely isn't the issue)
It's also strange as the failure doesn't seem to be related to the non-tensor inputs (i.e sizes, alpha, beta, strides etc.) as I logged those and some calls with the same sizes/strides work while others don't.
I can imagine that for the DataParallel
case some tensors reside on the "wrong" GPU which is why it fails. But no idea how to get the tensors GPU at the point of the call.
could you use the latest stable release with CUDA 11.7 or the nightly with 11.8 and check if you are still seeing the error?
As mentioned I tried with CUDA 11.5 and it worked. Another toolchain (GCC 11.3, Python 3.10) using CUDA 11.7 also worked. So it might be related to CUDA 11.4. I suspect PyTorch uses some CUDA 11.5+ specifics which are not supported in 11.4
I can try tomorrow if I can repro this with the PyPi 1.12 version and then test the PyPi 1.13
That failed. 1.12 isn't built with sm_80 (required for Ampere GPUs) and 1.13 requires CUDA >= 11.5 due to a dependency on libcublasLt.so.11
and also bundles CUDA. So I cannot test PyTorch 1.13 with CUDA 11.4 (currently at least).
If you can move to cuda 11.7-11.8 instead of 11.4, please do so (you don't need to update your driver for that). Pytorch is not using any 11.5+ specifics, but underlying cublas functions might have bugfixes in 11.5+
I tried on cuda11.7, the issue still presists
Collecting environment information...
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27
Python version: 3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1127.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB
Nvidia driver version: 515.43.04
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.24.2
[pip3] pytorch-lightning==1.9.1
[pip3] torch==1.13.1
[pip3] torch-cluster==1.6.0+pt113cu117
[pip3] torch-geometric==2.2.0
[pip3] torch-scatter==2.1.0+pt113cu117
[pip3] torch-sparse==0.6.16+pt113cu117
[pip3] torch-spline-conv==1.2.1+pt113cu117
[pip3] torchmetrics==0.11.1
[pip3] torchvision==0.14.1
[conda] blas 1.0 mkl defaults
[conda] mkl 2022.1.0 hc2b9512_224 defaults
[conda] numpy 1.24.2 pypi_0 pypi
[conda] pytorch 1.13.1 py3.8_cuda11.7_cudnn8.5.0_0 pytorch
[conda] pytorch-cuda 11.7 h67b0de4_1 pytorch
[conda] pytorch-lightning 1.9.1 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch-cluster 1.6.0+pt113cu117 pypi_0 pypi
[conda] torch-geometric 2.2.0 pypi_0 pypi
[conda] torch-scatter 2.1.0+pt113cu117 pypi_0 pypi
[conda] torch-sparse 0.6.16+pt113cu117 pypi_0 pypi
[conda] torch-spline-conv 1.2.1+pt113cu117 pypi_0 pypi
[conda] torchmetrics 0.11.1 pypi_0 pypi
[conda] torchvision 0.14.1 pypi_0 pypi
It looks cuda12 will work for me (fail at 11.4). @ptrblck do you know any fixes between 11.7 - 12 that can fix this problem. cc. @mikekgfb
do you know any fixes between 11.7 - 12 that can fix this problem.
@xw285cornell @mikekgfb No, since none of the mentioned tests in this issue fails for 2.0.1+cu117
on a multi-A100 system and I'm unable to reproduce it. Note that I haven't checked the 11.8 or 12.1 binaries as no failures were reported there at all.
cuda12.1, multi-A100,facing the same bug.
cuda12.1, multi-A100,facing the same bug.
update: It turns out it was my casual fault. I used 3node*6gpu A100s, and, however, set CUDA_VISIBLE_DEVICES to be 0,1,2,3,4,5,6,7. When I set CUDA_VISIBLE_DEVICES to be 0,1,2,3,4,5, the bug did not appear.
I just set bias=False in a linear function and problem solved.
🐛 Describe the bug
When running the
distributed/test_data_parallel
ortest_nn
test it fails withand
I found that
test_nn
usestorch.nn.DataParallel
when multiple GPUs are present so I assume it is the same issue. That error isn't listed in the NVIDIA docs forcublasDgemv
so I don't know why it fails.Versions
cc @ngimel @csarofeen @ptrblck @xwang233