Open Ahmedalaraby20 opened 1 year ago
im getting these same errors "Unable to initialize backend 'tpu_driver'", system hard crashes shortly afterwards
im getting these same errors "Unable to initialize backend 'tpu_driver'", system hard crashes shortly afterwards
same here...
im getting these same errors "Unable to initialize backend 'tpu_driver'", system hard crashes shortly afterwards
same here...
What is your CUDA ver in nvidia-smi ? Mine is 12.1
Hi @kbrunnerLXG Thats me
Tue Jun 13 07:59:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 42C P3 10W / 55W | 1233MiB / 8188MiB | 14% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2927 G /usr/lib/xorg/Xorg 87MiB |
| 0 N/A N/A 10153 C ...esources/app/bin/rsession 1142MiB |
+-----------------------------------------------------------------------------+
(base) ahmedhas@ahmedhas-Legion-Pro-5-16IRX8:~$
Same here. I am getting the following errors:
I0718 14:58:31.190772 139904529979200 run_docker.py:258] I0718 12:58:31.189951 140070948725248 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: I0718 14:58:31.404395 139904529979200 run_docker.py:258] I0718 12:58:31.403756 140070948725248 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Host CUDA Interpreter I0718 14:58:31.404671 139904529979200 run_docker.py:258] I0718 12:58:31.404166 140070948725248 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
The prediction does not crash, but is extremely slow. (half an hour for a 500 residue protein)
nvidia-smi tells me there is a python process running, but I am having a hard time believing that the alphafold subprocesses are correctly using the GPU. Docker image was built with the default CUDA 11.0 from the installation instructions, but I do have CUDA 12.0 on my system...
Exactly the same issue here; same errors and while it runs things are breathtakingly slow.
My nvidia-smi
output - pid 272923
is the python /app/alphafold/run_alphafold.py
proc.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4500 On | 00000000:01:00.0 Off | Off |
| 30% 29C P8 24W / 200W | 1982MiB / 20470MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A4500 On | 00000000:2C:00.0 Off | Off |
| 30% 36C P8 19W / 200W | 190MiB / 20470MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A4500 On | 00000000:41:00.0 Off | Off |
| 30% 36C P8 18W / 200W | 190MiB / 20470MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A4500 On | 00000000:61:00.0 Off | Off |
| 30% 33C P8 15W / 200W | 200MiB / 20470MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2916 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 272923 C python 182MiB |
| 1 N/A N/A 2916 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 272923 C python 182MiB |
| 2 N/A N/A 2916 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 272923 C python 182MiB |
| 3 N/A N/A 2916 G /usr/lib/xorg/Xorg 10MiB |
| 3 N/A N/A 3093 G /usr/bin/gnome-shell 4MiB |
| 3 N/A N/A 272923 C python 182MiB |
+-----------------------------------------------------------------------------+
same here. I have no idea if this is due to rocm or due to jax.
I will escalate this issue too. Does anyone have a fix?
Can't answer the AlphaFold question specifically, but those Unable to initialize backend
messages aren't errors. They're only informational logs as XLA iterates through candidate backends until it finds one that works. If you set JAX_PLATFORMS=cuda
you should find that those messages disappear.
I don't think AlphaFold is using my GPU either. I see a "--cpu 8 -N 1" line in the output and when I execute nvidia-smi
it says "No running processes found". It's taking more than 35 minutes now to run a 235AA monomer (gfp). I'm using CUDA 12.4 and Ubuntu 22.04.
Hey guys, I get this when i run alphafold
I am not sure if alphafold is running on my GPU or my CPU?
This is what i get when I run
nvidia-smi
Thanks alot