is alphafold using my GPU

Ahmedalaraby20 commented 1 year ago

Hey guys, I get this when i run alphafold

python3 docker/run_docker.py   --fasta_paths=/home/ahmedhas/Desktop/test.fasta   --max_template_date=2021-11-01   --model_preset=multimer   --data_dir=/media/ahmedhas/b9f6ec47-bc94-46bb-a83c-d497e70afa67/dataset/   --output_dir=/home/ahmedhas/Desktop/outs   --db_preset=reduced_dbs
I0529 16:16:07.631752 140679386469440 run_docker.py:116] Mounting /home/ahmedhas/Desktop -> /mnt/fasta_path_0
I0529 16:16:07.631856 140679386469440 run_docker.py:116] Mounting /media/ahmedhas/b9f6ec47-bc94-46bb-a83c-d497e70afa67/dataset/uniref90 -> /mnt/uniref90_database_path
I0529 16:16:07.631903 140679386469440 run_docker.py:116] Mounting /media/ahmedhas/b9f6ec47-bc94-46bb-a83c-d497e70afa67/dataset/mgnify -> /mnt/mgnify_database_path
I0529 16:16:07.631935 140679386469440 run_docker.py:116] Mounting /media/ahmedhas/b9f6ec47-bc94-46bb-a83c-d497e70afa67/dataset -> /mnt/data_dir
I0529 16:16:07.631963 140679386469440 run_docker.py:116] Mounting /media/ahmedhas/b9f6ec47-bc94-46bb-a83c-d497e70afa67/dataset/pdb_mmcif/mmcif_files -> /mnt/template_mmcif_dir
I0529 16:16:07.631996 140679386469440 run_docker.py:116] Mounting /media/ahmedhas/b9f6ec47-bc94-46bb-a83c-d497e70afa67/dataset/pdb_mmcif -> /mnt/obsolete_pdbs_path
I0529 16:16:07.632029 140679386469440 run_docker.py:116] Mounting /media/ahmedhas/b9f6ec47-bc94-46bb-a83c-d497e70afa67/dataset/uniprot -> /mnt/uniprot_database_path
I0529 16:16:07.632061 140679386469440 run_docker.py:116] Mounting /media/ahmedhas/b9f6ec47-bc94-46bb-a83c-d497e70afa67/dataset/pdb_seqres -> /mnt/pdb_seqres_database_path
I0529 16:16:07.632091 140679386469440 run_docker.py:116] Mounting /media/ahmedhas/b9f6ec47-bc94-46bb-a83c-d497e70afa67/dataset/small_bfd -> /mnt/small_bfd_database_path
I0529 16:16:09.276381 140679386469440 run_docker.py:258] I0529 14:16:09.275249 139670953879360 templates.py:857] Using precomputed obsolete pdbs /mnt/obsolete_pdbs_path/obsolete.dat.
I0529 16:16:18.028860 140679386469440 run_docker.py:258] I0529 14:16:18.028422 139670953879360 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
I0529 16:16:18.099451 140679386469440 run_docker.py:258] I0529 14:16:18.099100 139670953879360 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA Interpreter Host
I0529 16:16:18.099588 140679386469440 run_docker.py:258] I0529 14:16:18.099411 139670953879360 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0529 16:16:18.099622 140679386469440 run_docker.py:258] I0529 14:16:18.099450 139670953879360 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
I0529 16:16:54.207772 140679386469440 run_docker.py:258] I0529 14:16:54.207287 139670953879360 run_alphafold.py:466] Have 25 models: ['model_1_multimer_v3_pred_0', 'model_1_multimer_v3_pred_1', 'model_1_multimer_v3_pred_2', 'model_1_multimer_v3_pred_3', 'model_1_multimer_v3_pred_4', 'model_2_multimer_v3_pred_0', 'model_2_multimer_v3_pred_1', 'model_2_multimer_v3_pred_2', 'model_2_multimer_v3_pred_3', 'model_2_multimer_v3_pred_4', 'model_3_multimer_v3_pred_0', 'model_3_multimer_v3_pred_1', 'model_3_multimer_v3_pred_2', 'model_3_multimer_v3_pred_3', 'model_3_multimer_v3_pred_4', 'model_4_multimer_v3_pred_0', 'model_4_multimer_v3_pred_1', 'model_4_multimer_v3_pred_2', 'model_4_multimer_v3_pred_3', 'model_4_multimer_v3_pred_4', 'model_5_multimer_v3_pred_0', 'model_5_multimer_v3_pred_1', 'model_5_multimer_v3_pred_2', 'model_5_multimer_v3_pred_3', 'model_5_multimer_v3_pred_4']
I0529 16:16:54.207892 140679386469440 run_docker.py:258] I0529 14:16:54.207391 139670953879360 run_alphafold.py:480] Using random seed 364565440744827352 for the data pipeline
I0529 16:16:54.207949 140679386469440 run_docker.py:258] I0529 14:16:54.207519 139670953879360 run_alphafold.py:218] Predicting test
I0529 16:16:54.208099 140679386469440 run_docker.py:258] I0529 14:16:54.207985 139670953879360 pipeline_multimer.py:210] Running monomer pipeline on chain A: sequence_b
I0529 16:16:54.208168 140679386469440 run_docker.py:258] I0529 14:16:54.208112 139670953879360 jackhmmer.py:133] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/tmp3pinhm7j/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /tmp/tmpz7haf_o8.fasta /mnt/uniref90_database_path/uniref90.fasta"
I0529 16:16:54.229267 140679386469440 run_docker.py:258] I0529 14:16:54.228661 139670953879360 utils.py:36] Started Jackhmmer (uniref90.fasta) query

I am not sure if alphafold is running on my GPU or my CPU?

This is what i get when I run nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   43C    P8     2W /  55W |   1969MiB /  8188MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5667      G   /usr/lib/xorg/Xorg                 45MiB |
|    0   N/A  N/A     20650      C   python                            128MiB |
+-----------------------------------------------------------------------------+

Thanks alot

kbrunnerLXG commented 1 year ago

im getting these same errors "Unable to initialize backend 'tpu_driver'", system hard crashes shortly afterwards

joopedrofr commented 1 year ago

im getting these same errors "Unable to initialize backend 'tpu_driver'", system hard crashes shortly afterwards

same here...

kbrunnerLXG commented 1 year ago

im getting these same errors "Unable to initialize backend 'tpu_driver'", system hard crashes shortly afterwards

same here...

What is your CUDA ver in nvidia-smi ? Mine is 12.1

Ahmedalaraby20 commented 1 year ago

Hi @kbrunnerLXG Thats me

Tue Jun 13 07:59:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   42C    P3    10W /  55W |   1233MiB /  8188MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2927      G   /usr/lib/xorg/Xorg                 87MiB |
|    0   N/A  N/A     10153      C   ...esources/app/bin/rsession     1142MiB |
+-----------------------------------------------------------------------------+
(base) ahmedhas@ahmedhas-Legion-Pro-5-16IRX8:~$

OCald commented 1 year ago

Same here. I am getting the following errors:

I0718 14:58:31.190772 139904529979200 run_docker.py:258] I0718 12:58:31.189951 140070948725248 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: I0718 14:58:31.404395 139904529979200 run_docker.py:258] I0718 12:58:31.403756 140070948725248 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Host CUDA Interpreter I0718 14:58:31.404671 139904529979200 run_docker.py:258] I0718 12:58:31.404166 140070948725248 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'

The prediction does not crash, but is extremely slow. (half an hour for a 500 residue protein)

nvidia-smi tells me there is a python process running, but I am having a hard time believing that the alphafold subprocesses are correctly using the GPU. Docker image was built with the default CUDA 11.0 from the installation instructions, but I do have CUDA 12.0 on my system...

alexholehouse commented 1 year ago

Exactly the same issue here; same errors and while it runs things are breathtakingly slow.

My nvidia-smi output - pid 272923 is the python /app/alphafold/run_alphafold.py proc.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4500    On   | 00000000:01:00.0 Off |                  Off |
| 30%   29C    P8    24W / 200W |   1982MiB / 20470MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4500    On   | 00000000:2C:00.0 Off |                  Off |
| 30%   36C    P8    19W / 200W |    190MiB / 20470MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A4500    On   | 00000000:41:00.0 Off |                  Off |
| 30%   36C    P8    18W / 200W |    190MiB / 20470MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A4500    On   | 00000000:61:00.0 Off |                  Off |
| 30%   33C    P8    15W / 200W |    200MiB / 20470MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2916      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A    272923      C   python                            182MiB |
|    1   N/A  N/A      2916      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    272923      C   python                            182MiB |
|    2   N/A  N/A      2916      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A    272923      C   python                            182MiB |
|    3   N/A  N/A      2916      G   /usr/lib/xorg/Xorg                 10MiB |
|    3   N/A  N/A      3093      G   /usr/bin/gnome-shell                4MiB |
|    3   N/A  N/A    272923      C   python                            182MiB |
+-----------------------------------------------------------------------------+

ppxwwtaking commented 1 year ago

same here. I have no idea if this is due to rocm or due to jax.

YuriModeX commented 1 year ago

I will escalate this issue too. Does anyone have a fix?

saran-t commented 1 year ago

Can't answer the AlphaFold question specifically, but those Unable to initialize backend messages aren't errors. They're only informational logs as XLA iterates through candidate backends until it finds one that works. If you set JAX_PLATFORMS=cuda you should find that those messages disappear.

emilyrkang commented 6 months ago

I don't think AlphaFold is using my GPU either. I see a "--cpu 8 -N 1" line in the output and when I execute nvidia-smi it says "No running processes found". It's taking more than 35 minutes now to run a 235AA monomer (gfp). I'm using CUDA 12.4 and Ubuntu 22.04.

google-deepmind / alphafold

is alphafold using my GPU #771