Open rosario-purple opened 5 months ago
cuda 12.2 works for me with pytorch 2.2, same python 3.10.13
Works on Ubuntu 22.04 installed via docker pull ubuntu:22.04
torch install:
pip install torch
Collecting torch
Downloading torch-2.2.0-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting filelock (from torch)
Downloading filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting typing-extensions>=4.8.0 (from torch)
Downloading typing_extensions-4.9.0-py3-none-any.whl.metadata (3.0 kB)
Collecting sympy (from torch)
Downloading sympy-1.12-py3-none-any.whl (5.7 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 5.7/5.7 MB 88.1 MB/s eta 0:00:00
Collecting networkx (from torch)
Downloading networkx-3.2.1-py3-none-any.whl.metadata (5.2 kB)
Collecting jinja2 (from torch)
Downloading Jinja2-3.1.3-py3-none-any.whl.metadata (3.3 kB)
Collecting fsspec (from torch)
Downloading fsspec-2024.2.0-py3-none-any.whl.metadata (6.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 23.7/23.7 MB 82.9 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 823.6/823.6 kB 60.2 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 14.1/14.1 MB 123.7 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 410.6/410.6 MB 10.1 MB/s eta 0:00:00
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 121.6/121.6 MB 31.5 MB/s eta 0:00:00
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 56.5/56.5 MB 57.2 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch)
Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 124.2/124.2 MB 30.5 MB/s eta 0:00:00
Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch)
Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 196.0/196.0 MB 20.2 MB/s eta 0:00:00
Collecting nvidia-nccl-cu12==2.19.3 (from torch)
Downloading nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-nvtx-cu12==12.1.105 (from torch)
Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 99.1/99.1 kB 8.5 MB/s eta 0:00:00
Collecting triton==2.2.0 (from torch)
Downloading triton-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch)
Downloading nvidia_nvjitlink_cu12-12.3.101-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch)
Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting mpmath>=0.19 (from sympy->torch)
Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 536.2/536.2 kB 47.2 MB/s eta 0:00:00
Downloading torch-2.2.0-cp310-cp310-manylinux1_x86_64.whl (755.5 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 755.5/755.5 MB 3.9 MB/s eta 0:00:00
Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 731.7/731.7 MB 4.1 MB/s eta 0:00:00
Downloading nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (166.0 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 166.0/166.0 MB 25.0 MB/s eta 0:00:00
Downloading triton-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (167.9 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 167.9/167.9 MB 23.8 MB/s eta 0:00:00
Downloading typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Downloading filelock-3.13.1-py3-none-any.whl (11 kB)
Downloading fsspec-2024.2.0-py3-none-any.whl (170 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 170.9/170.9 kB 16.7 MB/s eta 0:00:00
Downloading Jinja2-3.1.3-py3-none-any.whl (133 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 133.2/133.2 kB 13.9 MB/s eta 0:00:00
Downloading networkx-3.2.1-py3-none-any.whl (1.6 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 1.6/1.6 MB 97.7 MB/s eta 0:00:00
Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Downloading nvidia_nvjitlink_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (20.5 MB)
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.2.0+cu121'
ldd libtorch_cuda.so
linux-vdso.so.1 (0x00007ffc415fa000)
libc10_cuda.so => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./libc10_cuda.so (0x00007f6d4a04e000)
libcudart.so.12 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./../../nvidia/cuda_runtime/lib/libcudart.so.12 (0x00007f6d49c00000)
libcusparse.so.12 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./../../nvidia/cusparse/lib/libcusparse.so.12 (0x00007f6d39c00000)
libcufft.so.11 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./../../nvidia/cufft/lib/libcufft.so.11 (0x00007f6d2e000000)
libcusparseLt-f8b4a9fb.so.0 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./libcusparseLt-f8b4a9fb.so.0 (0x00007f6d2bc00000)
libnvToolsExt.so.1 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./../../nvidia/nvtx/lib/libnvToolsExt.so.1 (0x00007f6d2b800000)
libcurand.so.10 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./../../nvidia/curand/lib/libcurand.so.10 (0x00007f6d25200000)
libcublas.so.12 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./../../nvidia/cublas/lib/libcublas.so.12 (0x00007f6d1e800000)
libcublasLt.so.12 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./../../nvidia/cublas/lib/libcublasLt.so.12 (0x00007f6cfc800000)
libcudnn.so.8 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./../../nvidia/cudnn/lib/libcudnn.so.8 (0x00007f6cfc400000)
libnccl.so.2 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./../../nvidia/nccl/lib/libnccl.so.2 (0x00007f6cefa00000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f6d4a043000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6d4a03c000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f6d4a037000)
libc10.so => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./libc10.so (0x00007f6d49f39000)
libtorch_cpu.so => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./libtorch_cpu.so (0x00007f6cd85d1000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6d49b19000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f6cd83a7000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f6d49f19000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6cd817f000)
/lib64/ld-linux-x86-64.so.2 (0x00007f6d7c6f6000)
libnvJitLink.so.12 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./../../nvidia/cusparse/lib/../../nvjitlink/lib/libnvJitLink.so.12 (0x00007f6cd4c00000)
libgomp-a34b3233.so.1 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./libgomp-a34b3233.so.1 (0x00007f6cd4800000)
libcupti.so.12 => /root/miniconda3/lib/python3.10/site-packages/torch/lib/./../../nvidia/cuda_cupti/lib/libcupti.so.12 (0x00007f6cd3e00000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f6d49f10000)
@rosario-purple could you please run ldd /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so
on the machine you are seeing this issue ?
I guess there isn't much one can do here other than mark torch incompatible with msccl-executor-nccl One can override it with LD_LIBRARY_PATH though
Other solution would be do update bundled nccl binaries inside msccl-executor-nccl with the ones shipped with PyTorch (not sure it will work, but perhaps worth trying as NCCL should forward compatible)
cc @ptrblck
My best guess is that this is because I have MS-AMP installed (https://github.com/Azure/MS-AMP) which is pinned to an older version of NCCL (https://github.com/Azure/msccl-executor-nccl version 2.17.1), while PyTorch 2.2 depends on a newer version (NCCL 2.19.3).
That is the exact reason why it fails with an undefined symbol. ncclCommRegister
was introduced in NCCL v2.19, and is being utilized in PyTorch since November (https://github.com/pytorch/pytorch/commit/ab1f6d58bc57faa89b74b98a27fc38e90abf8520).
Yes, this should be addressed to as many users as possible. From I can see, it will breaks all torch import when nccl under 2.19 which actually still commonly used.
Also, since lately nccl actually had a bug with torch do training parallel, one solution is upgrade nccl, users might upgraded nccl but still linked wrongly.
Please make a guide for users to resolve issues relate to nccl, thank u!
@atalman Sure here's the output
(brr) alyssavance@7e72bd4e-01:/scratch/brr$ ldd /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so
linux-vdso.so.1 (0x00007ffc5f7c1000)
libc10_cuda.so => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libc10_cuda.so (0x00001541cbd1a000)
libcudart.so.12 => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12 (0x00001541cba00000)
libcusparse.so.12 => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 (0x00001541bba00000)
libcurand.so.10 => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/../../nvidia/curand/lib/libcurand.so.10 (0x00001541b5400000)
libcufft.so.11 => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/../../nvidia/cufft/lib/libcufft.so.11 (0x00001541a9800000)
libnvToolsExt.so.1 => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/../../nvidia/nvtx/lib/libnvToolsExt.so.1 (0x00001541a9400000)
libcudnn.so.8 => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn.so.8 (0x00001541a9000000)
libnccl.so.2 => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2 (0x0000154198e00000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00001541cbd07000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00001541cbd02000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00001541cbcfd000)
libc10.so => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libc10.so (0x00001541cb922000)
libtorch_cpu.so => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so (0x0000154181ee8000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00001541bb919000)
libcublas.so.12 => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.12 (0x000015417b600000)
libcublasLt.so.12 => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12 (0x0000154159600000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00001541593d4000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00001541cbcd9000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00001541591ab000)
/lib64/ld-linux-x86-64.so.2 (0x00001541f765b000)
libnvJitLink.so.12 => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/../../nvjitlink/lib/libnvJitLink.so.12 (0x0000154156000000)
libgomp-a34b3233.so.1 => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libgomp-a34b3233.so.1 (0x0000154155c00000)
libcupti.so.12 => /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_cupti/lib/libcupti.so.12 (0x0000154155200000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00001541cbcd2000)
This is highly problematic as NVIDIA provide NCCL 2.19 rpms for RHEL8 only for CUDA 12.2 and above, while PyTorch binaries are for CUDA 12.1: https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/
This is highly problematic as NVIDIA provide NCCL 2.19 rpms for RHEL8 only for CUDA 12.2 and above, while PyTorch binaries are for CUDA 12.1
NCCL uses CUDA 12.2 to build its binaries and statically links the CUDART to them. This is a common approach and will not cause any incompatibilities. In PyTorch we are depending on the NCCL PyPI wheel using the same toolchain. Could you explain why it's highly problematic?
This is highly problematic as NVIDIA provide NCCL 2.19 rpms for RHEL8 only for CUDA 12.2 and above, while PyTorch binaries are for CUDA 12.1
NCCL uses CUDA 12.2 to build its binaries and statically links the CUDART to them. This is a common approach and will not cause any incompatibilities. In PyTorch we are depending on the NCCL PyPI wheel using the same toolchain. Could you explain why it's highly problematic?
I'm building a container with PyTorch and I've always kept the CUDA rpms to be the same version as the one PyTorch binaries have been linked against. I just assumed it would cause problems if PyTorch itself is linked against different CUDA version.
In fact I had some problems after switching to CUDA 12.2, but now it turns out this was an unrelated thing. So maybe it will work...
this issue not happen usually because of torch linked cuda the system one can also handle, but when comes to cuda12.2 some function may not found but torch used it.
Hello all, I had the same problem myself. I am posting this to hopefully help anyone with a similar issue. For context, I'm running an Nvidia 4070 Ti Super GPU on my Windows workstation PC which has CUDA 12.4. This is supposed to be the latest installation. I'm using Ubuntu 22.04 as well, so I am running in WSL2. Now, the problem was that I've tried pip uninstalling and reinstalling PyTorch to no avail. Every time I try running PyTorch in Python, I would get this error:
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/.local/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>
from torch._C import * # noqa: F403
ImportError: /home/user/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister
I am aware that at the moment, PyTorch was built for CUDA 12.1, but I've got it to work after some hours of troubleshooting. Here is what ultimately worked for me:
sudo
command:
sudo pip3 uninstall -y torch torchvision torchaudio
pip3 uninstall -y torch torchvision torchaudio
pip3 cache purge
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
At the time of writing, I am running on CUDA 12.4 with PyTorch working now. Here's what it might look like:
import torch
import torchvision
import torchaudio
print(torch.__version__)
print(torchvision.__version__)
print(torchaudio.__version__)
print(torch.cuda.is_available())
Output:
2.4.0.dev20240326+cu121
0.19.0.dev20240327+cu121
2.2.0.dev20240327+cu121
True
Wishing everyone the best! And hopefully PyTorch would provide a stable version for CUDA 12.4 users. Happy coding.
ๅคงๅฎถๅฅฝ๏ผๆ่ชๅทฑไน้ๅฐไบๅๆ ท็้ฎ้ขใๆๅๅธๆญคๅ ๅฎนๆฏไธบไบๅธๆๅฏน้ๅฐ็ฑปไผผ้ฎ้ข็ไบบๆๆๅธฎๅฉใไฝไธบไธไธๆ๏ผๆๅจๅ ทๆ CUDA 12.4 ็ Windows ๅทฅไฝ็ซ PC ไธ่ฟ่ก Nvidia 4070 Ti Super GPUใ่ฟๅบ่ฏฅๆฏๆๆฐ็ๅฎ่ฃ ใๆไนไฝฟ็จ Ubuntu 22.04๏ผๆไปฅๆๅจ WSL2 ไธญ่ฟ่กใ็ฐๅจ๏ผ้ฎ้ขๆฏๆๅฐ่ฏ pip ๅธ่ฝฝๅนถ้ๆฐๅฎ่ฃ PyTorch ๆ ๆตไบไบใๆฏๆฌกๆๅฐ่ฏๅจ Python ไธญ่ฟ่ก PyTorch ๆถ๏ผ้ฝไผๆถๅฐๆญค้่ฏฏ๏ผ
>>> import torch Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/user/.local/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module> from torch._C import * # noqa: F403 ImportError: /home/user/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister
ๆ็ฅ้็ฎๅ PyTorch ๆฏไธบ CUDA 12.1 ๆๅปบ็๏ผไฝ็ป่ฟๅ ไธชๅฐๆถ็ๆ ้ๆ้คๅๆๅทฒ็ป่ฎฉๅฎๅฏไปฅๅทฅไฝไบใ่ฟๆ็ปๅฏนๆๆ็จ๏ผ
- ้ฆๅ ๏ผไฝฟ็จ pip ๅธ่ฝฝๆๆ PyTorch ่ฝฏไปถๅ ใไฝฟ็จๅไธไฝฟ็จ
sudo
ๅฝไปค้ฝๆง่ก็ธๅ็ๆไฝ๏ผsudo pip3 uninstall -y torch torchvision torchaudio pip3 uninstall -y torch torchvision torchaudio pip3 cache purge
- ไธบ CUDA 12.4 ๅฎ่ฃ nccl๏ผNvidia Collective Communications lib๏ผใๅบๆฌไธ๏ผๅฎ็ NCCL 2.20.5 ไบ 2024 ๅนด 3 ๆ 5 ๆฅๅๅธใๆจๅฏไปฅๅจ Nvidia ็ฝ็ซไธๆพๅฐๅฎ๏ผๅฆไธๆ็คบ๏ผ https: //developer.nvidia.com/nccl/nccl-downloadใ่ฟ่ก็ฝ็ปๅฎ่ฃ ๅฝไปคใ
- ๆฅไธๆฅ๏ผๆจ้่ฆๅฎ่ฃ Nvidia cuDNNใๅณไฝฟๆจ่ฎคไธบ่ชๅทฑๅทฒ็ปๆๆกไบ๏ผไน่ฏทๅๆฌกๆง่ก่ฟไบๆญฅ้ชคใๆจๅฏไปฅๅๅพNvidia ็ cuDNN ไธ่ฝฝ้กต้ข่ทๅ่ฏดๆใ
- ๆๅ๏ผๆๅไฝๆ้่ฆ็ไธๆญฅๆฏ้ๆฐๅฎ่ฃ PyTorchใ้คไบไฝฟ็จๅค้ดๆๅปบ๏ผไปฅไพฟๆไปฌ่ทๅพๆๆฐ็ๆฌ๏ผ
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
ๅจๆฐๅๆฌๆๆถ๏ผๆๆญฃๅจ CUDA 12.4 ไธ่ฟ่ก๏ผPyTorch ๆญฃๅจ่ฟ่กใๅฎๅฏ่ฝๅฆไธๆ็คบ๏ผ
import torch import torchvision import torchaudio print(torch.__version__) print(torchvision.__version__) print(torchaudio.__version__) print(torch.cuda.is_available())
่พๅบ๏ผ
2.4.0.dev20240326+cu121 0.19.0.dev20240327+cu121 2.2.0.dev20240327+cu121 True
็ฅๅคงๅฎถไธๅ้กบๅฉ๏ผๅธๆ PyTorch ่ฝไธบ CUDA 12.4 ็จๆทๆไพ็จณๅฎ็ๆฌใๅฟซไน็ผ็ ใ
I encountered the same problem and successfully used the method you provided. Thank you
Hello all, I had the same problem myself. I am posting this to hopefully help anyone with a similar issue. For context, I'm running an Nvidia 4070 Ti Super GPU on my Windows workstation PC which has CUDA 12.4. This is supposed to be the latest installation. I'm using Ubuntu 22.04 as well, so I am running in WSL2. Now, the problem was that I've tried pip uninstalling and reinstalling PyTorch to no avail. Every time I try running PyTorch in Python, I would get this error:
>>> import torch Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/user/.local/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module> from torch._C import * # noqa: F403 ImportError: /home/user/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister
I am aware that at the moment, PyTorch was built for CUDA 12.1, but I've got it to work after some hours of troubleshooting. Here is what ultimately worked for me:
1. First, uninstall all the PyTorch packages using pip. Do the same with and without the `sudo` command:
sudo pip3 uninstall -y torch torchvision torchaudio pip3 uninstall -y torch torchvision torchaudio pip3 cache purge
2. Install nccl (Nvidia Collective Communications lib) for CUDA 12.4. Basically, its NCCL 2.20.5 which was released on March 5th, 2024. You can find it on the Nvidia website as follows: https://developer.nvidia.com/nccl/nccl-download. Run the commands for the Network Install. 3. Next, you'll need to install Nvidia cuDNN. Even if you think you have it, do the steps again. You can go to [Nvidia's cuDNN download page](https://developer.nvidia.com/cudnn-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network) for instructions. 4. Finally, the last but most important step is to reinstall PyTorch. Except use the nightly build so that we get the latest version:
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
At the time of writing, I am running on CUDA 12.4 with PyTorch working now. Here's what it might look like:
import torch import torchvision import torchaudio print(torch.__version__) print(torchvision.__version__) print(torchaudio.__version__) print(torch.cuda.is_available())
Output:
2.4.0.dev20240326+cu121 0.19.0.dev20240327+cu121 2.2.0.dev20240327+cu121 True
Wishing everyone the best! And hopefully PyTorch would provide a stable version for CUDA 12.4 users. Happy coding.
Thanks for your contribution, it works
๐ Describe the bug
When I upgrade to PyTorch 2.2 via Pip, importing torch fails with an undefined symbol error:
Downgrading to Torch 2.1.2 fixed the problem. My best guess is that this is because I have MS-AMP installed (https://github.com/Azure/MS-AMP) which is pinned to an older version of NCCL (https://github.com/Azure/msccl-executor-nccl version 2.17.1), while PyTorch 2.2 depends on a newer version (NCCL 2.19.3).
Versions
Collecting environment information... PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35
Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35 Is CUDA available: N/A CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB GPU 1: NVIDIA A100-SXM4-80GB GPU 2: NVIDIA A100-SXM4-80GB GPU 3: NVIDIA A100-SXM4-80GB GPU 4: NVIDIA A100-SXM4-80GB GPU 5: NVIDIA A100-SXM4-80GB GPU 6: NVIDIA A100-SXM4-80GB GPU 7: NVIDIA A100-SXM4-80GB
Nvidia driver version: 545.23.08 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.3 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.3 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.3 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.3 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.3 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.3 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.3 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 Stepping: 6 BogoMIPS: 4000.04 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss h\ t syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdc\ m pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault inv\ pcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms i\ nvpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoin\ vd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear ar\ ch_capabilities Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB (96 instances) L1i cache: 3 MiB (96 instances) L2 cache: 192 MiB (48 instances) L3 cache: 32 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-47 NUMA node1 CPU(s): 48-95 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled
Versions of relevant libraries: [pip3] flake8==6.1.0 [pip3] numpy==1.24.4 [pip3] numpyro==0.9.2 [pip3] torch==2.2.0 [pip3] torchaudio==2.2.0 [pip3] torchvision==0.17.0 [pip3] triton==2.2.0 [conda] numpy 1.24.4 pypi_0 pypi [conda] numpyro 0.9.2 pypi_0 pypi [conda] torch 2.2.0 pypi_0 pypi [conda] torchaudio 2.2.0 pypi_0 pypi [conda] torchvision 0.17.0 pypi_0 pypi [conda] triton 2.2.0 pypi_0 pypi
cc @seemethere @malfet @osalpekar @atalman