rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.77k stars 304 forks source link

`cupy` wheel and `torch` wheel link to different NCCL shared libraries in RAPIDS CI containers #4465

Open tingyu66 opened 5 months ago

tingyu66 commented 5 months ago

We have experienced the same issue several times in CI wheel-tests workflows when using cupy and torch>=2.2 together:

    torch = import_optional("torch")
/pyenv/versions/3.9.19/lib/python3.9/site-packages/cugraph/utilities/utils.py:455: in import_optional
    return importlib.import_module(mod)
/pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/__init__.py:237: in <module>
    from torch._C import *  # noqa: F403
E   ImportError: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister

The root cause is that cupy points to the builtin libnccl.so (2.16.2) in the container, while pytorch links to libnccl.so (2.20.5) from the nvidia-nccl-cu11 wheel. The older NCCL version is often incompatible with latest pytorch releases, which causes problems when coupling with other PyTorch-derived libraries. When cupy is imported before torch, the older nccl in the system path shadows the version needed by pytorch, resulting in the undefined symbol error mentioned above.

One less-than-ideal solution is to always import torch first, but this approach is rather error-prone for users. We'd like to hear your suggestions @leofang as a core cupy dev on potential workarounds. For example, is there a way to modify environment variables so that CuPy loads NCCL from a non-system path? Thank you!

CC: @alexbarghi-nv @VibhuJawa @naimnv

tingyu66 commented 5 months ago

To reproduce: docker run --gpus all --rm -it --network=host rapidsai/citestwheel:cuda11.8.0-ubuntu20.04-py3.9 bash pip install cupy-cuda11x pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cu118 python -c "import cupy; import torch"

leofang commented 5 months ago

I feel something is wrong in your container in a way that I haven't fully understood. CuPy lazy-loads all CUDA libraries so import cupy does not trigger the loading of libnccl. Something else does (but I can't tell why).

The only CuPy module that links to libnccl is cupy_backends.cuda.libs.nccl, as can be confirmed as follows:

root@marie:/# for f in $(find / -type f,l -regex '/pyenv/**/.*.so'); do readelf -d $f | grep "nccl.so"; if [[ $? -eq 0 ]]; then echo $f; fi; done
 0x0000000000000001 (NEEDED)             Shared library: [libnccl.so.2]
/pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so
 0x0000000000000001 (NEEDED)             Shared library: [libnccl.so.2]
/pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy_backends/cuda/libs/nccl.cpython-39-x86_64-linux-gnu.so

Now, if you monitor the loaded DSOs you'll see this module (nccl.cpython-39-x86_64-linux-gnu.so) is actually not loaded (by design), but libnccl still gets loaded

root@marie:/# LD_DEBUG=libs python -c "import cupy" 2>&1 | grep nccl
      6505: find library=libnccl.so.2.16.2 [0]; searching
      6505:   trying file=/pyenv/versions/3.9.19/lib/libnccl.so.2.16.2
      6505:   trying file=/lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505: calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505: calling fini: /lib/x86_64-linux-gnu/libnccl.so.2.16.2 [0]

What's worse, when the import order is swapped two distinct copies of libnccl are loaded, one from the system (as shown above) and the other from the nccl wheel:

root@marie:/# LD_DEBUG=libs python -c "import torch; import cupy" 2>&1 | grep "calling init:.*nccl"
      6787: calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
      6787: calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2

So I am not sure what I'm looking at 🤷

Question for @tingyu66: If this container is owned/controlled by RAPIDS, can't you just remove the system NCCL?

tingyu66 commented 5 months ago

@leofang I just took another look and think I found the culprit. https://github.com/cupy/cupy/blob/a54b7abfed668e52de7f3eee7b3fe8ccaef34874/cupy/_environment.py#L270-L274

For wheel builds, cupy._environment loads specific CUDA library versions defined from .data/_wheel.json file.

root@1cc5aab-lcedt:~# cat /pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy/.data/_wheel.json

{"cuda": "11.x", "packaging": "pip", "cutensor": {"version": "2.0.1", "filenames": ["libcutensor.so.2.0.1"]}, "nccl": {"version": "2.16.2", "filenames": ["libnccl.so.2.16.2"]}, "cudnn": {"version": "8.8.1", "filenames": ["libcudnn.so.8.8.1", "libcudnn_ops_infer.so.8.8.1", "libcudnn_ops_train.so.8.8.1", "libcudnn_cnn_infer.so.8.8.1", "libcudnn_cnn_train.so.8.8.1", "libcudnn_adv_infer.so.8.8.1", "libcudnn_adv_train.so.8.8.1"]}}root@1cc5aab-lcedt:/pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy/.data# 

That explains why the runtime linker was trying to find the exact version 2.16.2 during import and not satisfied with any other libnccl even if RPATH and LD_LIBRARY_PATH are tweaked.

root@marie:/# LD_DEBUG=libs python -c "import cupy" 2>&1 | grep nccl
      6505: find library=libnccl.so.2.16.2 [0]; searching
      6505:   trying file=/pyenv/versions/3.9.19/lib/libnccl.so.2.16.2
      6505:   trying file=/lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505: calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505: calling fini: /lib/x86_64-linux-gnu/libnccl.so.2.16.2 [0]

Changing "nccl": {"version": "2.16.2", "filenames": ["libnccl.so.2.16.2"]} to {"version": "2.20.5", "filenames": ["libnccl.so.2"]} in _wheel.json to match PyT's requirement and update LD_LIBRARY_PATH:

root@1cc5aab-lcedt:~# LD_DEBUG=libs python -c "import cupy; import torch" 2>&1 | grep "calling init:.*nccl"
      4041: calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
root@1cc5aab-lcedt:~# LD_DEBUG=libs python -c "import torch; import cupy" 2>&1 | grep "calling init:.*nccl"
      4182: calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2

we finally have one nccl loaded. :upside_down_face:

leofang commented 5 months ago

Ah, good finding, I forgot there's the preload logic...

Could you file a bug in CuPy's issue tracker? I think the preload logic needs to happen as part of the lazy loading, not before it.

Now you have two ways to hack in your CI workflow :D

tingyu66 commented 5 months ago

Update: The fix from CuPy is expected to be released in version 13.2.0 sometime this week.