Is it possible to run the Docker method if my remote cluster only has cuda 12.1?

piercelab / tcrmodel2

Apache License 2.0

33 stars 6 forks source link

Is it possible to run the Docker method if my remote cluster only has cuda 12.1? #14

Closed nickrrose closed 9 months ago

nickrrose commented 9 months ago

I have been trying to do exactly that for a while now, I have limited docker experience, but the issue I keep getting is the following: Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared object file: No such file or directory

I edited the alphafold docker to use 11.2, but I'm not sure if I need to have our administrator download cuda 11.2 to make this work.

rui-yin commented 9 months ago

Hi @nickrrose ,

Are you trying to install the dependencies for TCRmodel using AlphaFold's docker image? If you are trying to set up the environment, it possible to not use Docker. You can install each packages individually using conda, pip, wget (see Option 2: Step-by-Step Installation in the readme file). That way you would be able to have more flexibility, and be able to work with a different cuda version.

Best, Rui

nickrrose commented 9 months ago

Hello! Unfortunately, I am running TCRmodel2 on nodes that can only be accessed through slurm (the only nodes on my remote cluster with GPU) so I don't think downloading the dependancies in a local conda environment will work (unless I am mistaken). As of now, I am currently just trying modify the alphafold docker to run on CUDA 12.1 (which is what is known to work on our nodes) but I am still having a bit of trouble. Thanks for the help though!

If relevant: my specific issue is with the cuda and cudnn libraries not being accessed correctly Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared object file: No such file or directory

rui-yin commented 9 months ago

Understood. Based on what you described, the simplest approach would be to reach out to your cluster administrator for assistance with the CUDA version. They have direct access to the cluster's configuration and are best equipped to offer efficient help in addressing the specific library issues you're encountering. I'm going to close the ticket for now, but please feel free to reopen if you have any additional questions!