Closed MarkusRainerSchmidt closed 1 year ago
Could you try running with --enable_gpu_relax=false
? This forces the relaxation step (which is failing for you) to run on CPU instead of GPU.
Alternatively, you could also try turning the relaxation step off completely using --models_to_relax=none
.
Thanks for the advice!!
The prediction finished successfully with --enable_gpu_relax=false
.
However this seems like an unsatisfactory workaround...
Do you think the error is a CUDA issue or is it rather the RTX A5000 GPU?
It looks like an OpenMM issue (which we use for the last relaxation step), since the model runs fine and uses your GPU. Could you try raising this issue with OpenMM developers? Another possible workaround might be updating your OpenMM installation to 7.7.0 -- maybe that will help.
We got help from the OpenMM devs: https://github.com/openmm/openmm/issues/3950 Now everything is running on GPU.
Great to hear you solved it! Kudos to OpenMM devs.
Hi,
We are trying to install alphafold on a cluster, where we do not have sudo privileges nor are we allowed to use docker. Hence, we are following the setup that can be found here: https://github.com/kalininalab/alphafold_non_docker
Alphafold runs until it reaches the Restraining step, where it can't find a compatible CUDA device. This happens even though the GPU was found (and used (?) ) in an earlier step.
Our GPU is an NVIDIA RTX A5000 with computeCapability: 8.6. We tried both with CUDA 11.3 and 12.0.
55 This issue might suggest that we need CUDA 11.1 instead.
However, the error they observe differs from ours.
Even if this is not a standard installation, do you know if the combination of CUDA 11.3/12.0 with an NVIDIA RTX A5000 GPU causes an issue with alphafold? Do you think this can be solved by installing CUDA 11.1? Would a docker installation solve this?
Here is the error log:
last line above repeats in a similar fashion
last 2 lines above repeats in a similar fashion
last line above repeats in a similar fashion
last 3 lines above repeat till the 37th attempt Here seems to be the root cause of the error?
last 3 lines above repeat till the 100th attempt