GPU usage metrics - Githubissues

shenker commented 1 year ago

New feature

It would be extremely useful if GPU usage metrics were recorded for GPU tasks.

Usage scenario

Using GPU resources efficiently on HPC is often a challenge. For example, basecalling Oxford Nanopore sequencing data using the dorado basecaller often takes quite a bit of tuning to get good performance on HPC, for the following reasons: 1) Duplex mode makes heavy use of random access over thousands of files, resulting in low GPU utilization if the shared filesystem cannot keep up. Being able to monitor GPU utilization would allow detecting and mitigating this issue. 2) Dorado exhibits widely different performance on different GPU hardware. HPC nodes often are equipped with heterogenous GPU hardware. When parallelizing dorado jobs, it would be useful to measure the relative performance gaps between different GPU hardware. This information could be used to fine-tune job GPU requirements. (In SLURM or other cluster managers it's usually possible to specify which GPU hardware you're willing to use for a job to, e.g., exclude very old Nvidia architectures that no longer offer acceptable performance for a particular task.) 3) Dorado is a heavy user of GPU VRAM, and crashes when it runs out. Monitoring VRAM usage would help users tune dorado parameters to optimize performance/VRAM usage and know which GPU hardware to request from the cluster manager.

These are very common issues when running dorado on HPC (there are tons of issues on dorado's bug tracker, see, e.g.: https://github.com/nanoporetech/dorado/issues/68, https://github.com/nanoporetech/dorado/issues/336, https://github.com/nanoporetech/dorado/issues/306). This is just a particular example, I imagine the same basic GPU metrics would be useful for most users running GPU tasks with nextflow.

Suggest implementation

An initial implementation could restrict itself to Nvidia GPUs, since those are overwhelmingly the most important for scientific computing.

Use nvidia-settings to record the following metrics:

GPU name(s) (useful in a heterogenous cluster to see what GPU hardware a particular task is running on). The same architecture with different amounts of VRAM should be considered different models (e.g., “A100 40GB” should be distinguished from “A100 80GB”)
GPU utilization (average)
GPU memory (average used, total available)
GPU memory utilization (% of time memory controller was busy)

(A quick google search turned up this list of ways to programmatically grab GPU metrics: https://unix.stackexchange.com/questions/38560/gpu-usage-monitoring-cuda)

It would be especially useful if in the report HTML, there was a way to look at all metrics broken down by GPU hardware. Perhaps a checkbox of GPU hardware names, and as you select or deselect GPU models, the resulting GPU utilization/VRAM plots change.

bentsherman commented 1 year ago

Agree that we can get some basic GPU metrics from nvidia-smi, which should always be available in an environment with Nvidia GPUs. With the caveat that it is not a replacement for the Nvidia profiler.

NingAnMe commented 1 year ago

As @shenker said, it's really necessary to monitor and set VRAM just like RAM to make better use of GPU. Programs utilizing GPU may use less VRAM and GPU power. If we could set VRAM usage for different processes just like RAM, there's hope to run multi tasks on the GPU simultaneously. Really looking forward to this feature being implemented.

bentsherman commented 1 year ago

Is it possible to split GPU memory across multiple tasks in an enforceable way? The only thing I know of is NVIDIA's multi instance GPU, but that is configured by the sys admin and then Nextflow would just see multiple smaller GPUs that it could request as normal.

NingAnMe commented 12 months ago

I think splitting one GPU to multi instance GPU is not necessary.

Improving the efficiency of GPU utilization on the 'local' platform may be simpler and could be achieved quickly. For local GPU usage, it only requires providing a "VRAM" directive for the process similar to the "memory" directive. Based on the total VRAM size set by the user and the process's VRAM directive Nextflow can then determine whether it can execute more processes.

It can be very challenging to submit tasks multiple times to the same node on an HPC or Cloud platform. In such cases, Nextflow may need to pack some processes based on the situation of the processes and submit them simultaneously, which would increase the complexity of Nextflow scheduling.

I haven't considered everything thoroughly, so please advise.

bentsherman commented 12 months ago

A directive for GPU memory isn't very useful because Nextflow has no way to enforce it. You might as well just use the maxForks directive based on how many processes you think you can fit onto your GPU at the same time. Even if a GPU process only uses e.g. half the VRAM, it could still saturate the CUDA cores or the memory bandwidth in which case you won't get any more speedup from running additional processes at the same time.

I think in the vast majority of cases it is better to have the GPU run one job at a time that is large enough to saturate it either in terms of compute or memory bandwidth (it's almost always the latter).

NingAnMe commented 12 months ago

Based on my current understanding of Nextflow, if multiple processes are using GPU simultaneously with the executor set to local, using the maxForks directive alone is not sufficient to meet the requirements. For example, a workflow has two GPU processes, A and B, and both are set with maxForks = 1. Process A using 15GB of VRAM and process B also using 15GB. The total VRAM of the GPU is 24GB. Now, if there are 10 files that need to be processed using the same workflow, when Nextflow schedules the execution, it is possible for both process A and process B to be executed simultaneously. This can lead to one of the processes encountering an error due to insufficient VRAM.

Currently, I am using a Redis-based mutual exclusion to solve this issue. However, I still hope to have a more elegant solution using only Nextflow. I would appreciate some advice.

nextflow-io / nextflow

GPU usage metrics #4286

New feature

Usage scenario

Suggest implementation