Sometime in the past several months, my Alphafold install stopped being able to find and use the GPU (nvidia RTX A4500, NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 )
I have been attempting a fresh install, and still no luck.
I am able to have docker find the GPU using the following command:
docker run --rm --gpus all nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu20.04 nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A4500 On | 00000000:01:00.0 On | Off |
| 30% 34C P8 23W / 200W | 818MiB / 20470MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
When I submit a run I get the errors below. It will run, but only using the CPU so it takes forever.
I1014 09:03:18.529073 128453379199424 run_docker.py:258] /bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
I1014 09:03:23.667894 128453379199424 run_docker.py:258] I1014 16:03:23.667354 129322417205888 xla_bridge.py:863] Unable to initialize backend 'cuda': jaxlib/cuda/versions_helpers.cc:98: operation cuInit(0) failed: Unknown CUDA error 303; cuGetErrorName failed. This probably means that JAX was unable to load the CUDA libraries.
I1014 09:03:23.668071 128453379199424 run_docker.py:258] I1014 16:03:23.667572 129322417205888 xla_bridge.py:863] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA
Hi,
same error for me with CUDA 12.6, driver 560.35.03 and 4 Nvidia L40S. nvidia-smi and docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi outputs are ok.
Any help welcome
Thanks
Sometime in the past several months, my Alphafold install stopped being able to find and use the GPU (nvidia RTX A4500, NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 )
I have been attempting a fresh install, and still no luck.
I am able to have docker find the GPU using the following command:
docker run --rm --gpus all nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu20.04 nvidia-smi
During the install I had to use the NVIDIA Docker cgroup issue fix referenced in the README (https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-801479573) and modify the Dockerfile according to another issue (https://github.com/google-deepmind/alphafold/issues/945)
When I submit a run I get the errors below. It will run, but only using the CPU so it takes forever.
Any recommendations are welcome Thanks!