Closed yukw777 closed 7 months ago
if use cuda: add follow in pyproject.toml
[[tool.pdm.source]]
url = "https://download.pytorch.org/whl/cu116"
verify_ssl = true
name = "torch"
@xiaojinwhu If you use cuda 11.7, you actually don't need to add an extra index as you can see above. That's the problem. It should work without adding that extra index. This works with pip
and poetry
.
I am working on Mac M1 and torch 1.13.1 is installed successfully, without CUDA. So I am afraid I am not able to reproduce it. You can try to research yourself, or if anyone else can help. For example, try to find out why it misses so files but other installers(such as pip
) don't, and what are the differences in the installed files.
I'm having a similar (probably even the same problem) and I suspect the install.cache
setting being the culprit here (I assume @yukw777 also has this set to true
).
I discovered the following issue with the nvidia libraries (nvidia_cublas_cu11, nvidia_cuda_nvrtc_cu11, etc.):
With install.cache
turned off, the directory structure is as follows:
nvidia
├ __init__.py
├ cublas
├ cuda_nvrtc
├ cuda_runtime
└ cudnn
nvidia_cublas_cu11-11.10.3.66.dist-info
nvidia_cuda_nvrtc_cu11-11.7.99.dist-info
nvidia_cuda_runtime_cu11-11.7.99.dist-info
nvidia_cudnn_cu11-8.5.0.96.dist-info
As soon as you activate install.cache
, the directory structure changes:
nvidia -> /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia
nvidia_cublas_cu11-11.10.3.66.dist-info
nvidia_cuda_nvrtc_cu11-11.7.99.dist-info
nvidia_cuda_runtime_cu11-11.7.99.dist-info
nvidia_cudnn_cu11-8.5.0.96.dist-info
The content of /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia
is obviously only
__init__.py
cudnn
I hope that this issue can be fixed somehow (I don't know how standard compliant several packages installing into a common package folder is) because the nvidia packages are the primary reaon I activated install.cache
in the first place.
@michaelze Thanks for the investigation, but the wheel /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia
only contains cudnn
itself:
https://pypi-browser.org/package/nvidia-cudnn-cu11/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl
So there might be some other packages that install cuda_*
folders into the nvidia
package, which can't work with the cache mechanism, where the cache key is the wheel name, as you can see.
Ah, yes you list the packages below. The problem is, when they share the namespace nvidia
, they don't use the PEP 420 style implicit namespace packages. When creating symlinks, PDM thinks they are different packages and won't create symlinks recursively.
Try setting pdm config install.cache_method pth
to see if another method works.
I tested your suggestion but the problem still persists.
Looking at the PyTorch source code (https://github.com/pytorch/pytorch/blob/v1.13.1/torch/__init__.py#L144) reveals the underlying problem:
nvidia
folder in all elements of the sys.pathSo the problem here is, I think,
install.cache_method pth
also not a possibility. If they would resolve the path to the libraries for each package while iterating sys.path, it could work...From looking at the code, PyTorch 2.0.0 might actually work with PDM and install.cache_method pth
as the code that loads the cuda libraries iterates all elements of sys.path
and looks for the nvidia subfolder and the library in each element individually.
- Nvidia distributing several packages that install into the same subfolder This makes creating a symlink at the top level impossible, symlinking would have to treat the nvidia packages in a special way (create the parent folder, create symlinks for the subfolders)
That special treat does exist, but for PEP 420 namespace packages(package without __init__.py
), not for one special package, and not going to be either. The best way to fix it is to remove the nvidia/__init__.py
from the nvidia distribution.
@michaelze It seems to work for me with pdm config install.cache_method pth
but symlink fails as mentioned above.
If you install torch
via Pypi, the full version name is 1.13.x+cu117
and the following CUDA dependencies will be shipped together via Pypi with torch
installation (those under nvidia
folder):
cublas
cuda_nvrtc
cuda_runtime
cudnn
See this function https://github.com/pytorch/pytorch/blob/v1.13.1/torch/__init__.py#L163.
When importing torch==1.13.x
(when >=2.0.0, the loading mechanism is different), the logic is:
libtorch_global_deps.so
, the following CUDA dependencies are checked during this procedure: libcublas.so.11
, libcudnn.so.8
and libnvToolsExt.so.1
.OSError
occurs, check whether it is caused by a missing of libcublas.so.11
, if so search for libcublas
and libcudnn
by exploiting sys.path
, otherwise raise the error.If you have a local CUDA toolkit 11.7 installation (may also work for 11.8, as long as libcublas.so.11
can be found) and have configured the LD_LIBRARY_PATH
correctly, in fact, all these CUDA dependencies could be found except cudnn
. So regardless of your setup of install.cache_method
, the top 3 files can always be found. But magically, because of the above logic, when OSError
occurs for missing of cudnn
but libcublas
is presented, it will not search for cudnn
.
If you don't have a local toolkit installation, now libcublas
is also missing at first place, then the pth
config could help the program to find all the CUDA dependencies.
It seems the installation of torch
wheel from PyTorch website will always have the necessary cudnn.so.*
file under torch/lib
directory, which doesn't exist if downloaded from Pypi. If you have a local CUDA installation, try downloading from their website, e.g.:
pdm add https://download.pytorch.org/whl/cu117/torch-1.13.1%2Bcu117-cp310-cp310-linux_x86_64.whl
The disadvantage is torch
cannot be cached in this way, but if you are using torch
from Pypi and without a local toolkit installation, I'm not sure whether all the functionality of torch
can be used.
It doesn't work with latest pdm or pytorch If there is actually a problem with nvidia, pytorch users will be happier if there is some way to compromise
I'm compelled to create a script that runs like this and copy it directly to the cache.
cp -r /home/user/.cache/pdm/packages/nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64/lib/nvidia/nccl /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/
cp -r /home/user/.cache/pdm/packages/nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64/lib/nvidia/nvtx /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/
cp -r /home/user/.cache/pdm/packages/nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64/lib/nvidia/cufft /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/
...
It works explicitly, but the user should not ask for it.
For example, is it possible to do a workaround that downloads only libraries from nvidia (explicitly named libraries like pdm.toml) directly instead of a symlink(cache_method)?
By the way, in my environment, pdm config install.cache_method pth
did not work.
Can anyone in this thread check if the issue still exists on the latest PDM? Much appreciated for that.
Yes, this occurs even in the latest PDM(2.10.3).
Fine, I'll paste the code comment to give more insight on why it happens: https://github.com/pdm-project/pdm/blob/837e7d076502c227adcf0b5a4c44836cbab333bd/src/pdm/installers/installers.py#L75-L82
PDM only looks at children if the parent dir is a namespace package. And PDM detects a namespace based on these rules:
So if the package breaks the assumption PDM doesn't know how to create symlinks properly, and I don't think it's something PDM can fix, or you need to disable install.cache for it.
yes. I understand that PDM is NOT the main cause.
However, as a PyTorch user, torch
and nvidia
are downloaded or copied directly to __pypackages__
(cp -r ~/.cache/pdm/packages/somelib/somelib
), and other libraries are placed under __pypackages__
using symlinks. Is it difficult to realize? (If you don't need that much significance, close it.)
The former has the problem that _C.cpython-3x-x86_64-linux-gnu.so
under the symlinked __pypackages__/3.x/lib/torch
cannot search for depending .so
libraries in __pypackages__/3.x/lib/nvidia
, the latter has the problem of nvidia
folder structure and the problem of respectively, caused by symlink.
The main cause is nvidia
is a normal package with a blank __init__.py
, in which case PDM will create a single symlink for the whole directory. Maybe we can implement a different link strategy to force PDM to create a symlink for each individual files.
PyTorch side problem is clearly a different issue.
This occurs when lib/torch
is symlinked, regardless of whether lib/nvidia
is a real or a symlink.
It might be better to open a separate new issue for this.
The solution is to copy everything (without using the cache method), but I think I would like to take advantage of the wonderful feature of linking from the cache.
For anyone coming here off search engines... I wiped my lock file and .venv, and the following worked for me (thanks to #2425!):
pdm config --local install.cache_method symlink_individual
For anyone coming here off search engines... I wiped my lock file and .venv, and the following worked for me (thanks to #2425!):
pdm config --local install.cache_method symlink_individual
still not work for pytorch 2.2.0 and latest pdm. I tried symlink_individual, hardlink and pth(I can't find it in document, maybe it's deleted in new version of pdm?) and none of them worked.
sible to do a workaround that downloads only libraries fro
still exists with pytorch 2.2 and pdm 2.12.3. see https://github.com/pdm-project/pdm/issues/2614
Make sure you run commands with
-v
flag before pasting the output.Steps to reproduce
pdm add torch
(1.13.1 is the latest version currently.)python -c 'import torch'
.Actual behavior
PyTorch should be imported without any errors.
Expected behavior
Environment Information
I "think" this is related to the fact that PyTorch 1.13.x introduced a new set of dependencies around cuda (https://github.com/pytorch/pytorch/pull/85097). Poetry had issues b/c of this (https://github.com/pytorch/pytorch/issues/88049) but it's since been resolved, but not for pdm. My guess is that it might be b/c pdm installs the cuda dependencies separately from pytorch and b/c of that the pytorch installation doesn't know about them. It's a bummer, b/c I wanted to give pdm a spin for a new project, for now I'm going to have to stick to poetry. :/