CUDA Binary dependency chain is wrong, leading to bad binary packaging

albanD commented 1 month ago

tl;dr: the version of FindCUDAToolkit that we use here is old enough that we are missing quite a few cuda lib from it and the dependency between cuda libs is not accurate. Loosely related, our script that updates the RPATH (the relative path used to find .so this depends on) within our .so to always look for the cuda installed within the pip package being shipped is broken and does not contain the appropriate entries for the newly added nvjitlink library.

This is the root cause of several user issues like https://github.com/pytorch/pytorch/issues/134929 and https://github.com/pytorch/pytorch/issues/131312 as far as I can tell. And I can also observe it locally, where running ldd on the libtorch_cuda.so that is shipped with the PyTorch 2.5 binary on PyPi, I get entries ilke:

    libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10
    libcublas.so.12 => /home/albandes/local/pytorch/3.10_release_binary_env/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.12

This mismatched libraries being picked up lead to arbitrary issues. The most common is when the installed binary for cuda 12.4 is installed with a global install <12.4. nvjitlink has been added as a dependency for libcusparse but not to our RPATH. Leading to the newer libcusparse being loaded with the global (old) nvjitlink.

How to fix this?

The most important fix and checks are:

Fix the binary wheel script we use to generate binaries to properly add libnvjitlink to the appropriate RPATH. Making sure we load it from the installed python package and not from somewhere else.
Add appropriate CI/Smoke test that ensures: each nvidia-* dependency we have has an appropriate RPATH entry in the appropriate .so (as checked with readelf).
(BE) Add a global CUDA install to our smoke test machines. Use ldd on the generated .so to ensure each library is loaded from the right place. We could even make the global cuda install a fixed old version to reflect a lot of user setups.
(BE) Clean up our RPATH script to only populate it on the relevant binaries. As of today all our .so that I tested have the full rpath, even things like _C.so which is the base cpython module with nothing else in it.
(might be needed for 1?) Upgrade FindCUDATookit.cmake now that we use cmake>3.18, we should be able to use the latest from cmake.

cc @seemethere @malfet @osalpekar @atalman @ptrblck @msaroufim

malfet commented 1 month ago

I don't think those dependencies come in any way from FindCUDA, but I feel fine if we just remove it and use stock one from cmake

Skylion007 commented 1 month ago

Note we may need to bump the minimum CMake in order to use the stock version.

malfet commented 1 month ago

@Skylion007 good point, in that case perhaps it's still better to vend the preferred version from Modules..

albanD commented 1 month ago

From a quick check, the stock version seemed to work with 3.18.

pytorch / pytorch

CUDA Binary dependency chain is wrong, leading to bad binary packaging #138460