Open matspunt opened 1 year ago
Hi @matspunt,
The issue you mentioned is not because of which nvcc
is installed, but because the yml file doesn't mention the version of pytorch to look for. As of now conda will try to fetch pytorch=2.1.1
which is not compatable with pytorch-cuda=11.7
(as listed in the yml file). A quick fix is to change pytorch-cuda=11.7
to pytorch-cuda=11.8
which is supported by current release of pytorch. Your nvcc installation do not need to exactly match the pytorch-cuda version, infact nvcc doesn't even need to the in the env variable for it to work. You can use the instructions on pytorch.org to set it up properly.
Hope this clarifies the issue.
Debadutta
This was very helpful to me in my HPC environment (RIS WUSTL)!
Hi @matspunt, The issue you mentioned is not because of which
nvcc
is installed, but because the yml file doesn't mention the version of pytorch to look for. As of now conda will try to fetchpytorch=2.1.1
which is not compatable withpytorch-cuda=11.7
(as listed in the yml file). A quick fix is to changepytorch-cuda=11.7
topytorch-cuda=11.8
which is supported by current release of pytorch. Your nvcc installation do not need to exactly match the pytorch-cuda version, infact nvcc doesn't even need to the in the env variable for it to work. You can use the instructions on pytorch.org to set it up properly.Hope this clarifies the issue.
Debadutta
Wrong. The current yml does not work with the actual RF2 code in this repo.
Hi @matspunt, The issue you mentioned is not because of which
nvcc
is installed, but because the yml file doesn't mention the version of pytorch to look for. As of now conda will try to fetchpytorch=2.1.1
which is not compatable withpytorch-cuda=11.7
(as listed in the yml file). A quick fix is to changepytorch-cuda=11.7
topytorch-cuda=11.8
which is supported by current release of pytorch. Your nvcc installation do not need to exactly match the pytorch-cuda version, infact nvcc doesn't even need to the in the env variable for it to work. You can use the instructions on pytorch.org to set it up properly. Hope this clarifies the issue. DebaduttaWrong. The current yml does not work with the actual RF2 code in this repo.
Hey @stianale, as I mentioned in my comment you need to change pytorch-cuda=11.7
to pytorch-cuda=11.8
in the yml file or explicitly mention the last pytorch
version to the one that supports pytorch-cuda=11.7
. For an easier time you can copy the yml from @lloydtripp RosettaFold2 repository.
Hi @matspunt, The issue you mentioned is not because of which
nvcc
is installed, but because the yml file doesn't mention the version of pytorch to look for. As of now conda will try to fetchpytorch=2.1.1
which is not compatable withpytorch-cuda=11.7
(as listed in the yml file). A quick fix is to changepytorch-cuda=11.7
topytorch-cuda=11.8
which is supported by current release of pytorch. Your nvcc installation do not need to exactly match the pytorch-cuda version, infact nvcc doesn't even need to the in the env variable for it to work. You can use the instructions on pytorch.org to set it up properly. Hope this clarifies the issue. DebaduttaWrong. The current yml does not work with the actual RF2 code in this repo.
Hey @stianale, as I mentioned in my comment you need to change
pytorch-cuda=11.7
topytorch-cuda=11.8
in the yml file or explicitly mention the lastpytorch
version to the one that supportspytorch-cuda=11.7
. For an easier time you can copy the yml from @lloydtripp RosettaFold2 repository.
For me that yields the following errors:
Downloading and Extracting Packages:
Preparing transaction: done
Verifying transaction: \
SafetyError: The package for pytorch located at /home/stian/miniconda3/pkgs/pytorch-2.1.1-py3.10_cuda11.8_cudnn8.7.0_0
appears to be corrupted. The path 'lib/python3.10/site-packages/torch/cuda/memory.py'
has an incorrect size.
reported size: 34961 bytes
actual size: 34955 bytes
ClobberError: This transaction has incompatible packages due to a shared path.
packages: nvidia/linux-64::cuda-cupti-11.8.87-0, nvidia/linux-64::cuda-nvtx-11.8.86-0
path: 'LICENSE'
ClobberError: This transaction has incompatible packages due to a shared path.
packages: defaults/linux-64::intel-openmp-2023.1.0-hdb19cb5_46306, defaults/linux-64::llvm-openmp-14.0.6-h9e868ea_0
path: 'lib/libiomp5.so'
ClobberError: This transaction has incompatible packages due to a shared path.
packages: defaults/linux-64::intel-openmp-2023.1.0-hdb19cb5_46306, defaults/linux-64::llvm-openmp-14.0.6-h9e868ea_0
path: 'lib/libomptarget.so'
I got around those errors, but now the same errors that appeared with the old yml file still arise with the new one:
Running on CPU
Traceback (most recent call last):
File "/media/stian/hgst6tb/OneDrive/DUS/PhD/All_Neis/Representative_genomes/RoseTTAFold2/network/predict.py", line 493, in
The Rosettafold repos are train wrecks as of now, with recipes not being close to working with the code provided... Similar, although not identical issues are faced with the RF2NA software, and it feels as if it is up to to the users themselves to figure a way out of the incompatabilities.
@stianale, I thought I would add to this thread. I was able to get RF2 to install on today, August 7th, 2024. I am using a WSL CUDA install of cuda_11.8.r11.8/compiler.31833905_0.
First, I edited @lloydtripp 's yml file to read:
name: RF2
channels:
- pytorch
- nvidia
- defaults
- conda-forge
dependencies:
- python=3.10
- pip
- cudatoolkit=11.8
- pytorch=2.1.1
- pytorch-cuda=11.8
- dglteam/label/cu117::dgl
- pyg::pyg
- bioconda::hhsuite
- pandas=2.2.0
That way, we can have the needed cudatoolkit already installed before we reinstall pytorch. To my understanding, the pytorch error with respect to cuda was mainly because cuda does not appear available given the yml-directed install of pytorch. Additionally, parts of pytorch that were used to make RF2 functional are already deprecated. Installing RF2 will likely continue to be a serious difficulty. I recommend looking into old forums/github posts, or even looking at the backend of Google colab notebooks. Those notebooks have to perform fresh installs of software upon every callable instance. This may provide some clues. Anyways, here are the steps in order that I took to have success:
Then, I did the following steps:
STEP 1. conda install ipython
STEP 2. conda uninstall pytorch
STEP 3. conda uninstall pytorch-cuda
STEP 4. Get the correct pytorch install command from the pytorch website:
/PATH/TO/miniconda3/envs/RF2/bin/pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118
STEP 5. Test that cuda is available in ipython:
import torch
print(torch.cuda.is_available()) # should read "True"
STEP 6. /PATH/TO/miniconda3/envs/RF2/bin/pip install torchdata
STEP 7. /PATH/TO/miniconda3/envs/RF2/bin/pip install pydantic
STEP 8. Download the correct dgl pip whl corresponding to pytorch 2.1.1 from https://data.dgl.ai/wheels/cu118/repo.html. In the future, this may need to be changed, so you can just look at https://data.dgl.ai/wheels/repo.html. Then, with the whl downloaded: /PATH/TO/miniconda3/envs/RF2/bin/pip install dgl-2.1.0+cu118-cp310-cp310-manylinux1_x86_64.whl
STEP 9. Now you need to create this file: /PATH/TO/miniconda3/envs/RF2/lib/python3.10/site-packages/torch/utils/_import_utils.py
. You can copy the code from this repo [LINK]
STEP 10. Now install the transformer from the RoseTTAFold2/SE3Transformer directory:
/PATH/TO/miniconda3/envs/RF2/bin/pip install --no-cache-dir -r requirements.txt
python setup.py install
Lastly, when actually running the predictions I had to either export MKL_NUM_THREADS=1
in my bash script that executes predict.py
. Alternatively, you could specify this in python using import mkl
and mkl.set_num_threads(1)
.
I can confirm that this has worked for me. Again, while I pose a solution, there may be some underlying difficulties that vary based on your computing environment. But overall, the main issue is that the pytorch installation that is directed from the yml file does not natively read your cuda library. This thread has done a good job identifying the specific cuda and pytorch versions that are needed. But there may (likely) come a time where the default pulls for software will grab the wrong dependencies and mess everything up. Here I went directly to the pytorch website for the installation command, and then recreated deprecated files that are imported in RF2's ./network/predict.py function
I think we can close this ticket. Hope this helps, Austin
Hi,
To users: if RF2 defaults to CPU and upon running
torch.cuda.is_available()
you obtain False, read below.Be careful when building your conda environment that the CUDA version that is found (
which nvcc
) in the RF2 conda environment is compatible with thepytorch-cuda
version in the environment. I.e. if system CUDA is used, it cannot be greater than >11.7 (seenvidia-smi
). If using Python CUDA package is used, ensurecudatoolkit
version in your environment matches 11.7 . Default behaviour for conda is to install the latest versioncudatoolkit-12.2
, which leads to the PyTorch issue.To developers: perhaps a dependency on
cudatoolkit=11.7
orcudatoolkit-dev=11.7
can be added to the environment?Note: I have used CUDA 12.0 succesfully (with upgraded
pytorch-cuda
) and saw no difference in the performance or output of RoseTTAFold2 but I can't comment in detail on that. 11.7 works fine too.Cheers,
Mats