Open Linardos opened 1 week ago
Were you able to install PyTorch?
Yes
@szmazurek: can you think of anything for this? I am unable to replicate it on 3 machines (Windows, Ubuntu, Mint). I have put together a small script to get some debugging information from the environment here. Can you think of anything else to add?
Yeah, so with pypi I can imagine that, afaik we did not have the package built and uploaded here. Regarding the installation from the source it seems that you are missing Nvidia compiler (nvcc), which is apparently needed by deepspeed dependency. Can you check if nvcc is installed @Linardos? If not, perhaps installation would do the trick. Next thing can be PATH setting, ensure that all Nvidia related binaries are accessible.
If NVCC is needed, perhaps it might make sense to include it in the documentation. I believe installing one of the following (based on the user's system) should be fine:
Thanks for helping us catch this, @Linardos! I am guessing that since all of my (and Szymon's) machines are set up for development, nvcc
is automatically found and we don't encounter this.
Relevant issue from DeepSpeed: https://github.com/microsoft/DeepSpeed/issues/2772
EDIT: I also found a cuda-python package on pip but I think that's only for CUDA12.
Yeah, this indeed would be needed - @Linardos if you can confirm that the issue by @sarthakpati #25 will address that.
I just installed it through pip, but that doesn't seem to solve it. I have CUDA 12.4 in my machine
(gsynth) locolinux2@IN-OTA-232347:~/GaNDLF-Synth$ pip install nvidia-cuda-nvcc-cu12
Requirement already satisfied: nvidia-cuda-nvcc-cu12 in /home/locolinux2/miniconda3/envs/gsynth/lib/python3.9/site-packages (12.6.77)
(gsynth) locolinux2@IN-OTA-232347:~/GaNDLF-Synth$ pip install .
Processing /home/locolinux2/GaNDLF-Synth
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting GANDLF@ git+https://github.com/mlcommons/GandLF.git@master (from gandlf_synth==0.0.1.dev0)
Cloning https://github.com/mlcommons/GandLF.git (to revision master) to /tmp/pip-install-kuttpdr9/gandlf_8583bbd08e34436b9794e4167f37ac38
Running command git clone --filter=blob:none --quiet https://github.com/mlcommons/GandLF.git /tmp/pip-install-kuttpdr9/gandlf_8583bbd08e34436b9794e4167f37ac38
Resolved https://github.com/mlcommons/GandLF.git to commit 709f6ab59e57782f0b1937b24a1d8a85cd222c42
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting black==23.11.0 (from gandlf_synth==0.0.1.dev0)
Using cached black-23.11.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (66 kB)
Collecting lightning==2.4.0 (from gandlf_synth==0.0.1.dev0)
Using cached lightning-2.4.0-py3-none-any.whl.metadata (38 kB)
Collecting monai-generative==0.2.3 (from gandlf_synth==0.0.1.dev0)
Using cached monai_generative-0.2.3-py3-none-any.whl.metadata (4.6 kB)
Collecting deepspeed==0.15.1 (from gandlf_synth==0.0.1.dev0)
Using cached deepspeed-0.15.1.tar.gz (1.4 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [8 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-kuttpdr9/deepspeed_fd7409a5b2dd41cda27dd8d978d665d2/setup.py", line 108, in <module>
cuda_major_ver, cuda_minor_ver = installed_cuda_version()
File "/tmp/pip-install-kuttpdr9/deepspeed_fd7409a5b2dd41cda27dd8d978d665d2/op_builder/builder.py", line 51, in installed_cuda_version
raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
However, I installed nvcc through sudo apt install nvidia-cuda-toolkit
instead and that worked.
(gsynth) locolinux2@IN-OTA-232347:~/GaNDLF-Synth$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
(gsynth) locolinux2@IN-OTA-232347:~/GaNDLF-Synth$ pip install .
Processing /home/locolinux2/GaNDLF-Synth
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting GANDLF@ git+https://github.com/mlcommons/GandLF.git@master (from gandlf_synth==0.0.1.dev0)
Cloning https://github.com/mlcommons/GandLF.git (to revision master) to /tmp/pip-install-1letf4zy/gandlf_9ec66c16bdc542989ba33faa0c893907
Running command git clone --filter=blob:none --quiet https://github.com/mlcommons/GandLF.git /tmp/pip-install-1letf4zy/gandlf_9ec66c16bdc542989ba33faa0c893907
Resolved https://github.com/mlcommons/GandLF.git to commit 709f6ab59e57782f0b1937b24a1d8a85cd222c42
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting black==23.11.0 (from gandlf_synth==0.0.1.dev0)
Using cached black-23.11.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (66 kB)
Collecting lightning==2.4.0 (from gandlf_synth==0.0.1.dev0)
Using cached lightning-2.4.0-py3-none-any.whl.metadata (38 kB)
Collecting monai-generative==0.2.3 (from gandlf_synth==0.0.1.dev0)
Using cached monai_generative-0.2.3-py3-none-any.whl.metadata (4.6 kB)
Collecting deepspeed==0.15.1 (from gandlf_synth==0.0.1.dev0)
Using cached deepspeed-0.15.1.tar.gz (1.4 MB)
Preparing metadata (setup.py) ... done
Collecting click>=8.0.0 (from black==23.11.0->gandlf_synth==0.0.1.dev0)
Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
...
it seems to have been installed successfully.
I think the "solution" of doing sudo install
anything is inherently problematic (security issues, and not all folks might have root level access). Is there any way we can check if this would work using conda
instead?
This one should work then maybe add that step in the README (I didn't test it but it seems to be the standard steps to do it with conda):
conda install -c nvidia cudatoolkit
Verify your installation with
nvcc --version
Cool. In this case, we need to have an explicit dependency on conda.
I do not think that requiring nvcc as the underlying requirement is problematic from the user's perspective, it is basically something you need alongside CUDA drivers for this package. Falling back to conda is one solution, but I would not push it as the only go-to, rather a workaround (also it can be included in the container).
Since it is on the user-level, I think conda
should be the primary solution. Anything that is system-level (i.e., sudo install
or equivalent) should be the fallback.
I followed the steps to install exactly as described but none of the options work sadly:
Package not in pip nor conda:
But it neithers works through cloning and installing directly: