Closed tuttlelm closed 4 weeks ago
Hi, same error for me with CUDA 12.6, driver 560.35.03 and 4 Nvidia L40S. nvidia-smi and docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi outputs are ok. Any help welcome Thanks
There might be an issue with the AlphaFold execution script.
First, verify if the container can access GPU properly:
docker run --rm -it --gpus all --entrypoint /bin/bash alphafold
Inside the container, check if nvidia-smi and jax library are properly connected:
nvidia-smi
python -c "import jax; nmp = jax.numpy.ones((20000, 20000)); print('Device:', nmp.device()); result = jax.numpy.dot(nmp, nmp); print('Done')"
If these work normally, the issue might be with the docker-py library. You can verify this by running the following test:
import unittest
import docker
class TestDocker(unittest.TestCase):
def test_docker(self):
client = docker.from_env()
device_requests = [
docker.types.DeviceRequest(
driver="nvidia",
capabilities=[["gpu"]],
)
]
logs = client.containers.run(
"nvidia/cuda:12.2.2-runtime-ubuntu20.04",
"nvidia-smi",
runtime="nvidia",
device_requests=device_requests,
remove=True,
)
print(logs.decode("utf-8"))
if __name__ == "__main__":
unittest.main()
If this test runs successfully and shows nvidia-smi output, look for other potential issues.
If the test fails, the issue is likely with docker-py's GPU device recognition. You can fix this by modifying the AlphaFold script:
# alphafold/docker/run_docker.py
# Original code - line 232
client = docker.from_env()
device_requests = [
docker.types.DeviceRequest(driver='nvidia', capabilities=[['gpu']])
] if FLAGS.use_gpu else None
# Modified code
client = docker.from_env()
device_requests = (
[docker.types.DeviceRequest(driver="nvidia", capabilities=[["gpu"]], count=-1)]
if use_gpu
else None
)
I encountered this issue when using docker-py==5.0.0 with the latest system Docker version. The exact cause is unclear, but it appears to be related to GPU device recognition between docker-py and the Docker daemon.
The issue can be resolved by adding the count=-1
parameter to the DeviceRequest, which explicitly tells docker-py to use all available GPUs. This seems to be a compatibility issue between specific versions of docker-py and the Docker daemon's GPU handling.
If you're experiencing similar issues, try the modification shown in the code above.
I hope this solution works for your case.
If not, please let me know and we can explore other potential solutions. This issue appears to be version-specific between docker-py and Docker daemon, so there might be alternative approaches worth investigating.
Thanks so much for the response. Modifying the run_docker.py script with count=-1
seems to have done the trick. I no longer get the initial Unknown CUDA error 303
and the GPU is being used for the runs.
I ran the recommended tests inside the container and those passed. I was not sure how to create the docker-py library test script within the container, so I could not run it there. Running outside the container just gave a bunch of errors.
Update: I do still have an issue with the minimization portion.
Error loading CUDA module: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222)
But as other's have noted the GPU isn't really necessary for the relaxation steps, so using --enable_gpu_relax=false
has everything running nicely again. Thank goodness!
Sometime in the past several months, my Alphafold install stopped being able to find and use the GPU (nvidia RTX A4500, NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 )
I have been attempting a fresh install, and still no luck.
I am able to have docker find the GPU using the following command:
docker run --rm --gpus all nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu20.04 nvidia-smi
During the install I had to use the NVIDIA Docker cgroup issue fix referenced in the README (https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-801479573) and modify the Dockerfile according to another issue (https://github.com/google-deepmind/alphafold/issues/945)
When I submit a run I get the errors below. It will run, but only using the CPU so it takes forever.
Any recommendations are welcome Thanks!