kalininalab / alphafold_non_docker

AlphaFold2 non-docker setup
338 stars 119 forks source link

GPU Utilization verification #63

Open rocketman8080 opened 1 year ago

rocketman8080 commented 1 year ago

Hello, how would I be able to tell if the GPU is used during the computation? There are some warnings seen during initialization in the output below, I am unable to interpret them.

I see CPU is hitting full throttle, however GPU stats (at least ones I know of as shown below) appear to suggest that GPU is not engaged. Are there only certain phases in the batch which leverage GPU, or GPU should be used uniformly throughout the process?

Also, what is roughly the time expected to complete T1050 with / without GPU on AWS DLAMI instance?

Thank you for your help!

Output observed during a sample run:

bash run_alphafold.sh -d /datavol/af_download_data/ -o /datavol/output/ -f /datavol/input/T1050.fasta -t 2020-05-14
I0613 00:58:20.870254 139965401540416 templates.py:857] Using precomputed obsolete pdbs /datavol/af_download_data//pdb_mmcif/obsolete.dat.
I0613 00:58:23.896280 139965401540416 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
I0613 00:58:24.890674 139965401540416 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA Host Interpreter
I0613 00:58:24.891157 139965401540416 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0613 00:58:24.891273 139965401540416 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
I0613 00:58:40.890466 139965401540416 run_alphafold.py:386] Have 5 models: ['model_1_pred_0', 'model_2_pred_0', 'model_3_pred_0', 'model_4_pred_0', 'model_5_pred_0']
I0613 00:58:40.890670 139965401540416 run_alphafold.py:403] Using random seed 1833629380694816286 for the data pipeline
I0613 00:58:40.890937 139965401540416 run_alphafold.py:161] Predicting T1050
I0613 00:58:40.907537 139965401540416 jackhmmer.py:133] Launching subprocess "/home/ubuntu/miniconda3/envs/alphafold/bin/jackhmmer -o /dev/null -A /tmp/tmpklyon7kf/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /datavol/input/T1050.fasta /datavol/af_download_data//uniref90/uniref90.fasta"
I0613 00:58:40.965327 139965401540416 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0613 01:23:07.388309 139965401540416 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 1466.423 seconds
I0613 01:23:16.428411 139965401540416 jackhmmer.py:133] Launching subprocess "/home/ubuntu/miniconda3/envs/alphafold/bin/jackhmmer -o /dev/null -A /tmp/tmpqkxr2xv_/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /datavol/input/T1050.fasta /datavol/af_download_data//mgnify/mgy_clusters_2022_05.fa"
I0613 01:23:16.484976 139965401540416 utils.py:36] Started Jackhmmer (mgy_clusters_2022_05.fa) query

I0613 02:01:53.529423 139965401540416 utils.py:40] Finished Jackhmmer (mgy_clusters_2022_05.fa) query in 2317.044 seconds
I0613 02:02:32.089717 139965401540416 hhsearch.py:85] Launching subprocess "/home/ubuntu/miniconda3/envs/alphafold/bin/hhsearch -i /tmp/tmp_ldwhf2m/query.a3m -o /tmp/tmp_ldwhf2m/output.hhr -maxseq 1000000 -d /datavol/af_download_data//pdb70/pdb70"
I0613 02:02:32.152268 139965401540416 utils.py:36] Started HHsearch query
I0613 02:04:59.141074 139965401540416 utils.py:40] Finished HHsearch query in 146.988 seconds
I0613 02:05:05.967663 139965401540416 hhblits.py:128] Launching subprocess "/home/ubuntu/miniconda3/envs/alphafold/bin/hhblits -i /datavol/input/T1050.fasta -cpu 4 -oa3m /tmp/tmpl4__239q/output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d /datavol/af_download_data//bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d /datavol/af_download_data//uniref30/UniRef30_2021_03"
I0613 02:05:06.033180 139965401540416 utils.py:36] Started HHblits query

And some basic GPU stats below,

(base) ubuntu@alphafold20:/datavol$ nvidia-smi Tue Jun 13 03:07:03 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 32C P0 49W / 300W | 2102MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 5250 C python 308MiB | +-----------------------------------------------------------------------------+

nvidia-smi --format=csv --query-gpu=power.draw,utilization.gpu,fan.speed,temperature.gpu power.draw [W], utilization.gpu [%], fan.speed [%], temperature.gpu 49.46 W, 0 %, [N/A], 32

alonmillet commented 1 year ago

My understanding is that the GPU is not used during the MSA steps, only during structure generation and relaxation. The output you shared stops near the end of the MSA. Based on the first few lines of output it looks like your GPU driver spun up just fine, so your GPU utilization should go up later in the process.