google-deepmind / alphafold

Open source code for AlphaFold 2.
Apache License 2.0
12.92k stars 2.29k forks source link

Alphafold runs will not find the GPU #1029

Closed tuttlelm closed 4 weeks ago

tuttlelm commented 1 month ago

Sometime in the past several months, my Alphafold install stopped being able to find and use the GPU (nvidia RTX A4500, NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 )

I have been attempting a fresh install, and still no luck.

I am able to have docker find the GPU using the following command:

docker run --rm --gpus all nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu20.04 nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4500               On  | 00000000:01:00.0  On |                  Off |
| 30%   34C    P8              23W / 200W |    818MiB / 20470MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

During the install I had to use the NVIDIA Docker cgroup issue fix referenced in the README (https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-801479573) and modify the Dockerfile according to another issue (https://github.com/google-deepmind/alphafold/issues/945)

When I submit a run I get the errors below. It will run, but only using the CPU so it takes forever.


I1014 09:03:18.529073 128453379199424 run_docker.py:258] /bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
I1014 09:03:23.667894 128453379199424 run_docker.py:258] I1014 16:03:23.667354 129322417205888 xla_bridge.py:863] Unable to initialize backend 'cuda': jaxlib/cuda/versions_helpers.cc:98: operation cuInit(0) failed: Unknown CUDA error 303; cuGetErrorName failed. This probably means that JAX was unable to load the CUDA libraries.
I1014 09:03:23.668071 128453379199424 run_docker.py:258] I1014 16:03:23.667572 129322417205888 xla_bridge.py:863] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA

Any recommendations are welcome Thanks!

tiburonpiwi commented 1 month ago

Hi, same error for me with CUDA 12.6, driver 560.35.03 and 4 Nvidia L40S. nvidia-smi and docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi outputs are ok. Any help welcome Thanks

jung-geun commented 1 month ago

There might be an issue with the AlphaFold execution script.

First, verify if the container can access GPU properly:

docker run --rm -it --gpus all --entrypoint /bin/bash alphafold

Inside the container, check if nvidia-smi and jax library are properly connected:

nvidia-smi

python -c "import jax; nmp = jax.numpy.ones((20000, 20000)); print('Device:', nmp.device()); result = jax.numpy.dot(nmp, nmp); print('Done')"

If these work normally, the issue might be with the docker-py library. You can verify this by running the following test:

import unittest
import docker

class TestDocker(unittest.TestCase):
    def test_docker(self):
        client = docker.from_env()
        device_requests = [
            docker.types.DeviceRequest(
                driver="nvidia",
                capabilities=[["gpu"]],
            )
        ]

        logs = client.containers.run(
            "nvidia/cuda:12.2.2-runtime-ubuntu20.04",
            "nvidia-smi",
            runtime="nvidia",
            device_requests=device_requests,
            remove=True,
        )

        print(logs.decode("utf-8"))

if __name__ == "__main__":
    unittest.main()

If this test runs successfully and shows nvidia-smi output, look for other potential issues.

If the test fails, the issue is likely with docker-py's GPU device recognition. You can fix this by modifying the AlphaFold script:

# alphafold/docker/run_docker.py

# Original code - line 232
client = docker.from_env()
device_requests = [
    docker.types.DeviceRequest(driver='nvidia', capabilities=[['gpu']])
] if FLAGS.use_gpu else None

# Modified code
client = docker.from_env()
device_requests = (
    [docker.types.DeviceRequest(driver="nvidia", capabilities=[["gpu"]], count=-1)]
    if use_gpu
    else None
)

I encountered this issue when using docker-py==5.0.0 with the latest system Docker version. The exact cause is unclear, but it appears to be related to GPU device recognition between docker-py and the Docker daemon.

The issue can be resolved by adding the count=-1 parameter to the DeviceRequest, which explicitly tells docker-py to use all available GPUs. This seems to be a compatibility issue between specific versions of docker-py and the Docker daemon's GPU handling.

If you're experiencing similar issues, try the modification shown in the code above.

I hope this solution works for your case.

If not, please let me know and we can explore other potential solutions. This issue appears to be version-specific between docker-py and Docker daemon, so there might be alternative approaches worth investigating.

tuttlelm commented 1 month ago

Thanks so much for the response. Modifying the run_docker.py script with count=-1 seems to have done the trick. I no longer get the initial Unknown CUDA error 303 and the GPU is being used for the runs.

I ran the recommended tests inside the container and those passed. I was not sure how to create the docker-py library test script within the container, so I could not run it there. Running outside the container just gave a bunch of errors.

Update: I do still have an issue with the minimization portion. Error loading CUDA module: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222)

But as other's have noted the GPU isn't really necessary for the relaxation steps, so using --enable_gpu_relax=false has everything running nicely again. Thank goodness!