google-deepmind / alphafold

Open source code for AlphaFold.
Apache License 2.0
12.63k stars 2.24k forks source link

Alphafold runs will not find the GPU #1029

Open tuttlelm opened 1 week ago

tuttlelm commented 1 week ago

Sometime in the past several months, my Alphafold install stopped being able to find and use the GPU (nvidia RTX A4500, NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 )

I have been attempting a fresh install, and still no luck.

I am able to have docker find the GPU using the following command:

docker run --rm --gpus all nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu20.04 nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4500               On  | 00000000:01:00.0  On |                  Off |
| 30%   34C    P8              23W / 200W |    818MiB / 20470MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

During the install I had to use the NVIDIA Docker cgroup issue fix referenced in the README (https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-801479573) and modify the Dockerfile according to another issue (https://github.com/google-deepmind/alphafold/issues/945)

When I submit a run I get the errors below. It will run, but only using the CPU so it takes forever.


I1014 09:03:18.529073 128453379199424 run_docker.py:258] /bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
I1014 09:03:23.667894 128453379199424 run_docker.py:258] I1014 16:03:23.667354 129322417205888 xla_bridge.py:863] Unable to initialize backend 'cuda': jaxlib/cuda/versions_helpers.cc:98: operation cuInit(0) failed: Unknown CUDA error 303; cuGetErrorName failed. This probably means that JAX was unable to load the CUDA libraries.
I1014 09:03:23.668071 128453379199424 run_docker.py:258] I1014 16:03:23.667572 129322417205888 xla_bridge.py:863] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA

Any recommendations are welcome Thanks!

tiburonpiwi commented 13 hours ago

Hi, same error for me with CUDA 12.6, driver 560.35.03 and 4 Nvidia L40S. nvidia-smi and docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi outputs are ok. Any help welcome Thanks