Open albanD opened 1 month ago
I don't think those dependencies come in any way from FindCUDA, but I feel fine if we just remove it and use stock one from cmake
Note we may need to bump the minimum CMake in order to use the stock version.
@Skylion007 good point, in that case perhaps it's still better to vend the preferred version from Modules..
From a quick check, the stock version seemed to work with 3.18.
tl;dr: the version of FindCUDAToolkit that we use here is old enough that we are missing quite a few cuda lib from it and the dependency between cuda libs is not accurate. Loosely related, our script that updates the RPATH (the relative path used to find .so this depends on) within our .so to always look for the cuda installed within the pip package being shipped is broken and does not contain the appropriate entries for the newly added nvjitlink library.
This is the root cause of several user issues like https://github.com/pytorch/pytorch/issues/134929 and https://github.com/pytorch/pytorch/issues/131312 as far as I can tell. And I can also observe it locally, where running
ldd
on the libtorch_cuda.so that is shipped with the PyTorch 2.5 binary on PyPi, I get entries ilke:This mismatched libraries being picked up lead to arbitrary issues. The most common is when the installed binary for cuda 12.4 is installed with a global install <12.4. nvjitlink has been added as a dependency for libcusparse but not to our RPATH. Leading to the newer libcusparse being loaded with the global (old) nvjitlink.
How to fix this?
The most important fix and checks are:
readelf
).ldd
on the generated .so to ensure each library is loaded from the right place. We could even make the global cuda install a fixed old version to reflect a lot of user setups.cc @seemethere @malfet @osalpekar @atalman @ptrblck @msaroufim