uw-ipd / RoseTTAFold2

MIT License
178 stars 39 forks source link

Packaged RF2-linux.yml pins pytorch-cuda=11.7, may lead to issues with CUDA version #25

Open matspunt opened 1 year ago

matspunt commented 1 year ago

Hi,

To users: if RF2 defaults to CPU and upon running torch.cuda.is_available() you obtain False, read below.

Be careful when building your conda environment that the CUDA version that is found (which nvcc) in the RF2 conda environment is compatible with the pytorch-cuda version in the environment. I.e. if system CUDA is used, it cannot be greater than >11.7 (see nvidia-smi). If using Python CUDA package is used, ensure cudatoolkit version in your environment matches 11.7 . Default behaviour for conda is to install the latest version cudatoolkit-12.2, which leads to the PyTorch issue.

To developers: perhaps a dependency on cudatoolkit=11.7 or cudatoolkit-dev=11.7 can be added to the environment?

Note: I have used CUDA 12.0 succesfully (with upgraded pytorch-cuda) and saw no difference in the performance or output of RoseTTAFold2 but I can't comment in detail on that. 11.7 works fine too.

Cheers,

Mats

debadutta-patra commented 1 year ago

Hi @matspunt, The issue you mentioned is not because of which nvcc is installed, but because the yml file doesn't mention the version of pytorch to look for. As of now conda will try to fetch pytorch=2.1.1 which is not compatable with pytorch-cuda=11.7 (as listed in the yml file). A quick fix is to change pytorch-cuda=11.7 to pytorch-cuda=11.8 which is supported by current release of pytorch. Your nvcc installation do not need to exactly match the pytorch-cuda version, infact nvcc doesn't even need to the in the env variable for it to work. You can use the instructions on pytorch.org to set it up properly.

Hope this clarifies the issue.

Debadutta

lloydtripp commented 9 months ago

This was very helpful to me in my HPC environment (RIS WUSTL)!

stianale commented 9 months ago

Hi @matspunt, The issue you mentioned is not because of which nvcc is installed, but because the yml file doesn't mention the version of pytorch to look for. As of now conda will try to fetch pytorch=2.1.1 which is not compatable with pytorch-cuda=11.7 (as listed in the yml file). A quick fix is to change pytorch-cuda=11.7 to pytorch-cuda=11.8 which is supported by current release of pytorch. Your nvcc installation do not need to exactly match the pytorch-cuda version, infact nvcc doesn't even need to the in the env variable for it to work. You can use the instructions on pytorch.org to set it up properly.

Hope this clarifies the issue.

Debadutta

Wrong. The current yml does not work with the actual RF2 code in this repo.

debadutta-patra commented 9 months ago

Hi @matspunt, The issue you mentioned is not because of which nvcc is installed, but because the yml file doesn't mention the version of pytorch to look for. As of now conda will try to fetch pytorch=2.1.1 which is not compatable with pytorch-cuda=11.7 (as listed in the yml file). A quick fix is to change pytorch-cuda=11.7 to pytorch-cuda=11.8 which is supported by current release of pytorch. Your nvcc installation do not need to exactly match the pytorch-cuda version, infact nvcc doesn't even need to the in the env variable for it to work. You can use the instructions on pytorch.org to set it up properly. Hope this clarifies the issue. Debadutta

Wrong. The current yml does not work with the actual RF2 code in this repo.

Hey @stianale, as I mentioned in my comment you need to change pytorch-cuda=11.7 to pytorch-cuda=11.8 in the yml file or explicitly mention the last pytorch version to the one that supports pytorch-cuda=11.7. For an easier time you can copy the yml from @lloydtripp RosettaFold2 repository.

stianale commented 9 months ago

Hi @matspunt, The issue you mentioned is not because of which nvcc is installed, but because the yml file doesn't mention the version of pytorch to look for. As of now conda will try to fetch pytorch=2.1.1 which is not compatable with pytorch-cuda=11.7 (as listed in the yml file). A quick fix is to change pytorch-cuda=11.7 to pytorch-cuda=11.8 which is supported by current release of pytorch. Your nvcc installation do not need to exactly match the pytorch-cuda version, infact nvcc doesn't even need to the in the env variable for it to work. You can use the instructions on pytorch.org to set it up properly. Hope this clarifies the issue. Debadutta

Wrong. The current yml does not work with the actual RF2 code in this repo.

Hey @stianale, as I mentioned in my comment you need to change pytorch-cuda=11.7 to pytorch-cuda=11.8 in the yml file or explicitly mention the last pytorch version to the one that supports pytorch-cuda=11.7. For an easier time you can copy the yml from @lloydtripp RosettaFold2 repository.

For me that yields the following errors:

Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: \ 
SafetyError: The package for pytorch located at /home/stian/miniconda3/pkgs/pytorch-2.1.1-py3.10_cuda11.8_cudnn8.7.0_0
appears to be corrupted. The path 'lib/python3.10/site-packages/torch/cuda/memory.py'
has an incorrect size.
  reported size: 34961 bytes
  actual size: 34955 bytes

ClobberError: This transaction has incompatible packages due to a shared path.
  packages: nvidia/linux-64::cuda-cupti-11.8.87-0, nvidia/linux-64::cuda-nvtx-11.8.86-0
  path: 'LICENSE'

ClobberError: This transaction has incompatible packages due to a shared path.
  packages: defaults/linux-64::intel-openmp-2023.1.0-hdb19cb5_46306, defaults/linux-64::llvm-openmp-14.0.6-h9e868ea_0
  path: 'lib/libiomp5.so'

ClobberError: This transaction has incompatible packages due to a shared path.
  packages: defaults/linux-64::intel-openmp-2023.1.0-hdb19cb5_46306, defaults/linux-64::llvm-openmp-14.0.6-h9e868ea_0
  path: 'lib/libomptarget.so'
stianale commented 9 months ago

I got around those errors, but now the same errors that appeared with the old yml file still arise with the new one:

Running on CPU Traceback (most recent call last): File "/media/stian/hgst6tb/OneDrive/DUS/PhD/All_Neis/Representative_genomes/RoseTTAFold2/network/predict.py", line 493, in pred.predict( File "/media/stian/hgst6tb/OneDrive/DUS/PhD/All_Neis/Representative_genomes/RoseTTAFold2/network/predict.py", line 316, in predict torch.cuda.reset_peak_memory_stats() File "/home/stian/miniconda3/envs/RF2/lib/python3.10/site-packages/torch/cuda/memory.py", line 307, in reset_peak_memory_stats return torch._C._cuda_resetPeakMemoryStats(device) RuntimeError: invalid argument to reset_peak_memory_stats

stianale commented 9 months ago

The Rosettafold repos are train wrecks as of now, with recipes not being close to working with the code provided... Similar, although not identical issues are faced with the RF2NA software, and it feels as if it is up to to the users themselves to figure a way out of the incompatabilities.

austinweigle commented 3 months ago

@stianale, I thought I would add to this thread. I was able to get RF2 to install on today, August 7th, 2024. I am using a WSL CUDA install of cuda_11.8.r11.8/compiler.31833905_0.

First, I edited @lloydtripp 's yml file to read:

name: RF2
channels:
  - pytorch
  - nvidia
  - defaults
  - conda-forge
dependencies:
  - python=3.10
  - pip
  - cudatoolkit=11.8
  - pytorch=2.1.1
  - pytorch-cuda=11.8
  - dglteam/label/cu117::dgl
  - pyg::pyg
  - bioconda::hhsuite
  - pandas=2.2.0

That way, we can have the needed cudatoolkit already installed before we reinstall pytorch. To my understanding, the pytorch error with respect to cuda was mainly because cuda does not appear available given the yml-directed install of pytorch. Additionally, parts of pytorch that were used to make RF2 functional are already deprecated. Installing RF2 will likely continue to be a serious difficulty. I recommend looking into old forums/github posts, or even looking at the backend of Google colab notebooks. Those notebooks have to perform fresh installs of software upon every callable instance. This may provide some clues. Anyways, here are the steps in order that I took to have success:

Then, I did the following steps: STEP 1. conda install ipython STEP 2. conda uninstall pytorch STEP 3. conda uninstall pytorch-cuda STEP 4. Get the correct pytorch install command from the pytorch website: /PATH/TO/miniconda3/envs/RF2/bin/pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118 STEP 5. Test that cuda is available in ipython:

import torch 
print(torch.cuda.is_available()) # should read "True"

STEP 6. /PATH/TO/miniconda3/envs/RF2/bin/pip install torchdata STEP 7. /PATH/TO/miniconda3/envs/RF2/bin/pip install pydantic STEP 8. Download the correct dgl pip whl corresponding to pytorch 2.1.1 from https://data.dgl.ai/wheels/cu118/repo.html. In the future, this may need to be changed, so you can just look at https://data.dgl.ai/wheels/repo.html. Then, with the whl downloaded: /PATH/TO/miniconda3/envs/RF2/bin/pip install dgl-2.1.0+cu118-cp310-cp310-manylinux1_x86_64.whl STEP 9. Now you need to create this file: /PATH/TO/miniconda3/envs/RF2/lib/python3.10/site-packages/torch/utils/_import_utils.py. You can copy the code from this repo [LINK] STEP 10. Now install the transformer from the RoseTTAFold2/SE3Transformer directory: /PATH/TO/miniconda3/envs/RF2/bin/pip install --no-cache-dir -r requirements.txt python setup.py install

Lastly, when actually running the predictions I had to either export MKL_NUM_THREADS=1 in my bash script that executes predict.py. Alternatively, you could specify this in python using import mkl and mkl.set_num_threads(1).

I can confirm that this has worked for me. Again, while I pose a solution, there may be some underlying difficulties that vary based on your computing environment. But overall, the main issue is that the pytorch installation that is directed from the yml file does not natively read your cuda library. This thread has done a good job identifying the specific cuda and pytorch versions that are needed. But there may (likely) come a time where the default pulls for software will grab the wrong dependencies and mess everything up. Here I went directly to the pytorch website for the installation command, and then recreated deprecated files that are imported in RF2's ./network/predict.py function

I think we can close this ticket. Hope this helps, Austin