Open tingyu66 opened 5 months ago
To reproduce:
docker run --gpus all --rm -it --network=host rapidsai/citestwheel:cuda11.8.0-ubuntu20.04-py3.9 bash
pip install cupy-cuda11x
pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cu118
python -c "import cupy; import torch"
I feel something is wrong in your container in a way that I haven't fully understood. CuPy lazy-loads all CUDA libraries so import cupy
does not trigger the loading of libnccl. Something else does (but I can't tell why).
The only CuPy module that links to libnccl is cupy_backends.cuda.libs.nccl
, as can be confirmed as follows:
root@marie:/# for f in $(find / -type f,l -regex '/pyenv/**/.*.so'); do readelf -d $f | grep "nccl.so"; if [[ $? -eq 0 ]]; then echo $f; fi; done
0x0000000000000001 (NEEDED) Shared library: [libnccl.so.2]
/pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so
0x0000000000000001 (NEEDED) Shared library: [libnccl.so.2]
/pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy_backends/cuda/libs/nccl.cpython-39-x86_64-linux-gnu.so
Now, if you monitor the loaded DSOs you'll see this module (nccl.cpython-39-x86_64-linux-gnu.so
) is actually not loaded (by design), but libnccl still gets loaded
root@marie:/# LD_DEBUG=libs python -c "import cupy" 2>&1 | grep nccl
6505: find library=libnccl.so.2.16.2 [0]; searching
6505: trying file=/pyenv/versions/3.9.19/lib/libnccl.so.2.16.2
6505: trying file=/lib/x86_64-linux-gnu/libnccl.so.2.16.2
6505: calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2
6505: calling fini: /lib/x86_64-linux-gnu/libnccl.so.2.16.2 [0]
What's worse, when the import order is swapped two distinct copies of libnccl are loaded, one from the system (as shown above) and the other from the nccl wheel:
root@marie:/# LD_DEBUG=libs python -c "import torch; import cupy" 2>&1 | grep "calling init:.*nccl"
6787: calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
6787: calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2
So I am not sure what I'm looking at 🤷
Question for @tingyu66: If this container is owned/controlled by RAPIDS, can't you just remove the system NCCL?
@leofang I just took another look and think I found the culprit. https://github.com/cupy/cupy/blob/a54b7abfed668e52de7f3eee7b3fe8ccaef34874/cupy/_environment.py#L270-L274
For wheel builds, cupy._environment
loads specific CUDA library versions defined from .data/_wheel.json
file.
root@1cc5aab-lcedt:~# cat /pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy/.data/_wheel.json
{"cuda": "11.x", "packaging": "pip", "cutensor": {"version": "2.0.1", "filenames": ["libcutensor.so.2.0.1"]}, "nccl": {"version": "2.16.2", "filenames": ["libnccl.so.2.16.2"]}, "cudnn": {"version": "8.8.1", "filenames": ["libcudnn.so.8.8.1", "libcudnn_ops_infer.so.8.8.1", "libcudnn_ops_train.so.8.8.1", "libcudnn_cnn_infer.so.8.8.1", "libcudnn_cnn_train.so.8.8.1", "libcudnn_adv_infer.so.8.8.1", "libcudnn_adv_train.so.8.8.1"]}}root@1cc5aab-lcedt:/pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy/.data#
That explains why the runtime linker was trying to find the exact version 2.16.2 during import and not satisfied with any other libnccl even if RPATH and LD_LIBRARY_PATH are tweaked.
root@marie:/# LD_DEBUG=libs python -c "import cupy" 2>&1 | grep nccl
6505: find library=libnccl.so.2.16.2 [0]; searching
6505: trying file=/pyenv/versions/3.9.19/lib/libnccl.so.2.16.2
6505: trying file=/lib/x86_64-linux-gnu/libnccl.so.2.16.2
6505: calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2
6505: calling fini: /lib/x86_64-linux-gnu/libnccl.so.2.16.2 [0]
Changing "nccl": {"version": "2.16.2", "filenames": ["libnccl.so.2.16.2"]}
to {"version": "2.20.5", "filenames": ["libnccl.so.2"]}
in _wheel.json to match PyT's requirement and update LD_LIBRARY_PATH:
root@1cc5aab-lcedt:~# LD_DEBUG=libs python -c "import cupy; import torch" 2>&1 | grep "calling init:.*nccl"
4041: calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
root@1cc5aab-lcedt:~# LD_DEBUG=libs python -c "import torch; import cupy" 2>&1 | grep "calling init:.*nccl"
4182: calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
we finally have one nccl loaded. :upside_down_face:
Ah, good finding, I forgot there's the preload logic...
Could you file a bug in CuPy's issue tracker? I think the preload logic needs to happen as part of the lazy loading, not before it.
Now you have two ways to hack in your CI workflow :D
Update: The fix from CuPy is expected to be released in version 13.2.0 sometime this week.
We have experienced the same issue several times in CI wheel-tests workflows when using
cupy
andtorch>=2.2
together:The root cause is that
cupy
points to the builtinlibnccl.so
(2.16.2) in the container, whilepytorch
links tolibnccl.so
(2.20.5) from thenvidia-nccl-cu11
wheel. The older NCCL version is often incompatible with latest pytorch releases, which causes problems when coupling with other PyTorch-derived libraries. When cupy is imported before torch, the older nccl in the system path shadows the version needed by pytorch, resulting in the undefined symbol error mentioned above.One less-than-ideal solution is to always
import torch
first, but this approach is rather error-prone for users. We'd like to hear your suggestions @leofang as a core cupy dev on potential workarounds. For example, is there a way to modify environment variables so that CuPy loads NCCL from a non-system path? Thank you!CC: @alexbarghi-nv @VibhuJawa @naimnv