Cannot install PyTorch 1.13.x with PDM

yukw777 commented 1 year ago

[x] I have searched the issue tracker and believe that this is not a duplicate.

Make sure you run commands with -v flag before pasting the output.

Steps to reproduce

Install PyTorch 1.13.x by running pdm add torch (1.13.1 is the latest version currently.)
Try to import pytorch in the interpreter python -c 'import torch'.

Actual behavior

PyTorch should be imported without any errors.

Expected behavior

❯ python -c 'import torch'
Traceback (most recent call last):
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 172, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcublas.so.11: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 217, in <module>
    _load_global_deps()
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 178, in _load_global_deps
    _preload_cuda_deps()
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 158, in _preload_cuda_deps
    ctypes.CDLL(cublas_path)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: .../.venv/lib/python3.10/site-packages/nvidia/cublas/lib/libcublas.so.11: cannot open shared object file: No such file or directory

Environment Information

PDM version:
  2.4.6
Python Interpreter:
  .../.venv/bin/python (3.10)
Project Root:
  ...
Project Packages:
  None
{
  "implementation_name": "cpython",
  "implementation_version": "3.10.10",
  "os_name": "posix",
  "platform_machine": "x86_64",
  "platform_release": "5.4.0-121-generic",
  "platform_system": "Linux",
  "platform_version": "#137-Ubuntu SMP Wed Jun 15 13:33:07 UTC 2022",
  "python_full_version": "3.10.10",
  "platform_python_implementation": "CPython",
  "python_version": "3.10",
  "sys_platform": "linux"
}

I "think" this is related to the fact that PyTorch 1.13.x introduced a new set of dependencies around cuda (https://github.com/pytorch/pytorch/pull/85097). Poetry had issues b/c of this (https://github.com/pytorch/pytorch/issues/88049) but it's since been resolved, but not for pdm. My guess is that it might be b/c pdm installs the cuda dependencies separately from pytorch and b/c of that the pytorch installation doesn't know about them. It's a bummer, b/c I wanted to give pdm a spin for a new project, for now I'm going to have to stick to poetry. :/

xiaojinwhu commented 1 year ago

if use cuda: add follow in pyproject.toml

[[tool.pdm.source]]
url = "https://download.pytorch.org/whl/cu116"
verify_ssl = true
name = "torch"

yukw777 commented 1 year ago

@xiaojinwhu If you use cuda 11.7, you actually don't need to add an extra index as you can see above. That's the problem. It should work without adding that extra index. This works with pip and poetry.

frostming commented 1 year ago

I am working on Mac M1 and torch 1.13.1 is installed successfully, without CUDA. So I am afraid I am not able to reproduce it. You can try to research yourself, or if anyone else can help. For example, try to find out why it misses so files but other installers(such as pip) don't, and what are the differences in the installed files.

michaelze commented 1 year ago

I'm having a similar (probably even the same problem) and I suspect the install.cache setting being the culprit here (I assume @yukw777 also has this set to true).

I discovered the following issue with the nvidia libraries (nvidia_cublas_cu11, nvidia_cuda_nvrtc_cu11, etc.):

With install.cache turned off, the directory structure is as follows:

nvidia
├ __init__.py
├ cublas
├ cuda_nvrtc
├ cuda_runtime
└ cudnn
nvidia_cublas_cu11-11.10.3.66.dist-info
nvidia_cuda_nvrtc_cu11-11.7.99.dist-info
nvidia_cuda_runtime_cu11-11.7.99.dist-info
nvidia_cudnn_cu11-8.5.0.96.dist-info

As soon as you activate install.cache, the directory structure changes:

nvidia -> /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia
nvidia_cublas_cu11-11.10.3.66.dist-info
nvidia_cuda_nvrtc_cu11-11.7.99.dist-info
nvidia_cuda_runtime_cu11-11.7.99.dist-info
nvidia_cudnn_cu11-8.5.0.96.dist-info

The content of /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia is obviously only

__init__.py
cudnn

I hope that this issue can be fixed somehow (I don't know how standard compliant several packages installing into a common package folder is) because the nvidia packages are the primary reaon I activated install.cache in the first place.

frostming commented 1 year ago

@michaelze Thanks for the investigation, but the wheel /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia only contains cudnn itself: https://pypi-browser.org/package/nvidia-cudnn-cu11/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl

So there might be some other packages that install cuda_* folders into the nvidia package, which can't work with the cache mechanism, where the cache key is the wheel name, as you can see.

Ah, yes you list the packages below. The problem is, when they share the namespace nvidia, they don't use the PEP 420 style implicit namespace packages. When creating symlinks, PDM thinks they are different packages and won't create symlinks recursively.

Try setting pdm config install.cache_method pth to see if another method works.

michaelze commented 1 year ago

I tested your suggestion but the problem still persists.

Looking at the PyTorch source code (https://github.com/pytorch/pytorch/blob/v1.13.1/torch/__init__.py#L144) reveals the underlying problem:

the code actually searches for the nvidia folder in all elements of the sys.path
if it is found, it constructs paths to the required libraries (libcublas and libcudnn) and checks for their existance
the last condition is never true for both of the libraries, only ever for one
in the end the code tries to load the two libraries from the current paths to libcublas and libcudnn which obviously fails.

So the problem here is, I think,

Nvidia distributing several packages that install into the same subfolder This makes creating a symlink at the top level impossible, symlinking would have to treat the nvidia packages in a special way (create the parent folder, create symlinks for the subfolders)
PyTorch using Knowledge about that directory structure directly to load the libraries This makes using install.cache_method pth also not a possibility. If they would resolve the path to the libraries for each package while iterating sys.path, it could work...

From looking at the code, PyTorch 2.0.0 might actually work with PDM and install.cache_method pth as the code that loads the cuda libraries iterates all elements of sys.path and looks for the nvidia subfolder and the library in each element individually.

frostming commented 1 year ago

Nvidia distributing several packages that install into the same subfolder This makes creating a symlink at the top level impossible, symlinking would have to treat the nvidia packages in a special way (create the parent folder, create symlinks for the subfolders)

That special treat does exist, but for PEP 420 namespace packages(package without __init__.py), not for one special package, and not going to be either. The best way to fix it is to remove the nvidia/__init__.py from the nvidia distribution.

JesseFarebro commented 12 months ago

@michaelze It seems to work for me with pdm config install.cache_method pth but symlink fails as mentioned above.

ocss884 commented 11 months ago

If you install torch via Pypi, the full version name is 1.13.x+cu117 and the following CUDA dependencies will be shipped together via Pypi with torch installation (those under nvidia folder):

cublas
cuda_nvrtc
cuda_runtime
cudnn

See this function https://github.com/pytorch/pytorch/blob/v1.13.1/torch/__init__.py#L163. When importing torch==1.13.x (when >=2.0.0, the loading mechanism is different), the logic is:

The program will first load dependencies via libtorch_global_deps.so, the following CUDA dependencies are checked during this procedure: libcublas.so.11, libcudnn.so.8 and libnvToolsExt.so.1.
If OSError occurs, check whether it is caused by a missing of libcublas.so.11, if so search for libcublas and libcudnn by exploiting sys.path, otherwise raise the error.

If you have a local CUDA toolkit 11.7 installation (may also work for 11.8, as long as libcublas.so.11 can be found) and have configured the LD_LIBRARY_PATH correctly, in fact, all these CUDA dependencies could be found except cudnn. So regardless of your setup of install.cache_method, the top 3 files can always be found. But magically, because of the above logic, when OSError occurs for missing of cudnn but libcublas is presented, it will not search for cudnn.

If you don't have a local toolkit installation, now libcublas is also missing at first place, then the pth config could help the program to find all the CUDA dependencies.

It seems the installation of torch wheel from PyTorch website will always have the necessary cudnn.so.* file under torch/lib directory, which doesn't exist if downloaded from Pypi. If you have a local CUDA installation, try downloading from their website, e.g.:

pdm add https://download.pytorch.org/whl/cu117/torch-1.13.1%2Bcu117-cp310-cp310-linux_x86_64.whl

The disadvantage is torch cannot be cached in this way, but if you are using torch from Pypi and without a local toolkit installation, I'm not sure whether all the functionality of torch can be used.

Ttayu commented 9 months ago

It doesn't work with latest pdm or pytorch If there is actually a problem with nvidia, pytorch users will be happier if there is some way to compromise

I'm compelled to create a script that runs like this and copy it directly to the cache.

cp -r /home/user/.cache/pdm/packages/nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64/lib/nvidia/nccl /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/
cp -r /home/user/.cache/pdm/packages/nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64/lib/nvidia/nvtx /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/
cp -r /home/user/.cache/pdm/packages/nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64/lib/nvidia/cufft /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/

...

It works explicitly, but the user should not ask for it.

For example, is it possible to do a workaround that downloads only libraries from nvidia (explicitly named libraries like pdm.toml) directly instead of a symlink(cache_method)?

By the way, in my environment, pdm config install.cache_method pth did not work.

frostming commented 7 months ago

Can anyone in this thread check if the issue still exists on the latest PDM? Much appreciated for that.

Ttayu commented 7 months ago

Yes, this occurs even in the latest PDM(2.10.3).

frostming commented 7 months ago

Fine, I'll paste the code comment to give more insight on why it happens: https://github.com/pdm-project/pdm/blob/837e7d076502c227adcf0b5a4c44836cbab333bd/src/pdm/installers/installers.py#L75-L82

PDM only looks at children if the parent dir is a namespace package. And PDM detects a namespace based on these rules:

https://github.com/pdm-project/pdm/blob/837e7d076502c227adcf0b5a4c44836cbab333bd/src/pdm/installers/installers.py#L49-L60

So if the package breaks the assumption PDM doesn't know how to create symlinks properly, and I don't think it's something PDM can fix, or you need to disable install.cache for it.

Ttayu commented 7 months ago

yes. I understand that PDM is NOT the main cause. However, as a PyTorch user, torch and nvidia are downloaded or copied directly to __pypackages__ (cp -r ~/.cache/pdm/packages/somelib/somelib), and other libraries are placed under __pypackages__ using symlinks. Is it difficult to realize? (If you don't need that much significance, close it.)

The former has the problem that _C.cpython-3x-x86_64-linux-gnu.so under the symlinked __pypackages__/3.x/lib/torch cannot search for depending .so libraries in __pypackages__/3.x/lib/nvidia, the latter has the problem of nvidia folder structure and the problem of respectively, caused by symlink.

frostming commented 7 months ago

The main cause is nvidia is a normal package with a blank __init__.py, in which case PDM will create a single symlink for the whole directory. Maybe we can implement a different link strategy to force PDM to create a symlink for each individual files.

Ttayu commented 7 months ago

PyTorch side problem is clearly a different issue.

This occurs when lib/torch is symlinked, regardless of whether lib/nvidia is a real or a symlink. It might be better to open a separate new issue for this.

The solution is to copy everything (without using the cache method), but I think I would like to take advantage of the wonderful feature of linking from the cache.

ae9is commented 6 months ago

For anyone coming here off search engines... I wiped my lock file and .venv, and the following worked for me (thanks to #2425!):

pdm config --local install.cache_method symlink_individual

fancyerii commented 5 months ago

For anyone coming here off search engines... I wiped my lock file and .venv, and the following worked for me (thanks to #2425!):
pdm config --local install.cache_method symlink_individual

still not work for pytorch 2.2.0 and latest pdm. I tried symlink_individual, hardlink and pth(I can't find it in document, maybe it's deleted in new version of pdm?) and none of them worked.

fancyerii commented 5 months ago

sible to do a workaround that downloads only libraries fro

still exists with pytorch 2.2 and pdm 2.12.3. see https://github.com/pdm-project/pdm/issues/2614

pdm-project / pdm