pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
84.45k stars 22.74k forks source link

CUDA Binary dependency chain is wrong, leading to bad binary packaging #138460

Open albanD opened 1 month ago

albanD commented 1 month ago

tl;dr: the version of FindCUDAToolkit that we use here is old enough that we are missing quite a few cuda lib from it and the dependency between cuda libs is not accurate. Loosely related, our script that updates the RPATH (the relative path used to find .so this depends on) within our .so to always look for the cuda installed within the pip package being shipped is broken and does not contain the appropriate entries for the newly added nvjitlink library.

This is the root cause of several user issues like https://github.com/pytorch/pytorch/issues/134929 and https://github.com/pytorch/pytorch/issues/131312 as far as I can tell. And I can also observe it locally, where running ldd on the libtorch_cuda.so that is shipped with the PyTorch 2.5 binary on PyPi, I get entries ilke:

    libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10
    libcublas.so.12 => /home/albandes/local/pytorch/3.10_release_binary_env/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.12

This mismatched libraries being picked up lead to arbitrary issues. The most common is when the installed binary for cuda 12.4 is installed with a global install <12.4. nvjitlink has been added as a dependency for libcusparse but not to our RPATH. Leading to the newer libcusparse being loaded with the global (old) nvjitlink.

How to fix this?

The most important fix and checks are:

cc @seemethere @malfet @osalpekar @atalman @ptrblck @msaroufim

malfet commented 1 month ago

I don't think those dependencies come in any way from FindCUDA, but I feel fine if we just remove it and use stock one from cmake

Skylion007 commented 1 month ago

Note we may need to bump the minimum CMake in order to use the stock version.

malfet commented 1 month ago

@Skylion007 good point, in that case perhaps it's still better to vend the preferred version from Modules..

albanD commented 1 month ago

From a quick check, the stock version seemed to work with 3.18.