Open shenker opened 1 year ago
Agree that we can get some basic GPU metrics from nvidia-smi
, which should always be available in an environment with Nvidia GPUs. With the caveat that it is not a replacement for the Nvidia profiler.
As @shenker said, it's really necessary to monitor and set VRAM just like RAM to make better use of GPU. Programs utilizing GPU may use less VRAM and GPU power. If we could set VRAM usage for different processes just like RAM, there's hope to run multi tasks on the GPU simultaneously. Really looking forward to this feature being implemented.
Is it possible to split GPU memory across multiple tasks in an enforceable way? The only thing I know of is NVIDIA's multi instance GPU, but that is configured by the sys admin and then Nextflow would just see multiple smaller GPUs that it could request as normal.
I think splitting one GPU to multi instance GPU is not necessary.
Improving the efficiency of GPU utilization on the 'local' platform may be simpler and could be achieved quickly. For local GPU usage, it only requires providing a "VRAM" directive for the process similar to the "memory" directive. Based on the total VRAM size set by the user and the process's VRAM directive Nextflow can then determine whether it can execute more processes.
It can be very challenging to submit tasks multiple times to the same node on an HPC or Cloud platform. In such cases, Nextflow may need to pack some processes based on the situation of the processes and submit them simultaneously, which would increase the complexity of Nextflow scheduling.
I haven't considered everything thoroughly, so please advise.
A directive for GPU memory isn't very useful because Nextflow has no way to enforce it. You might as well just use the maxForks
directive based on how many processes you think you can fit onto your GPU at the same time. Even if a GPU process only uses e.g. half the VRAM, it could still saturate the CUDA cores or the memory bandwidth in which case you won't get any more speedup from running additional processes at the same time.
I think in the vast majority of cases it is better to have the GPU run one job at a time that is large enough to saturate it either in terms of compute or memory bandwidth (it's almost always the latter).
Based on my current understanding of Nextflow, if multiple processes are using GPU simultaneously with the executor set to local
, using the maxForks
directive alone is not sufficient to meet the requirements.
For example, a workflow has two GPU processes, A and B, and both are set with maxForks = 1
. Process A using 15GB of VRAM and process B also using 15GB. The total VRAM of the GPU is 24GB. Now, if there are 10 files that need to be processed using the same workflow, when Nextflow schedules the execution, it is possible for both process A and process B to be executed simultaneously. This can lead to one of the processes encountering an error due to insufficient VRAM.
Currently, I am using a Redis-based mutual exclusion to solve this issue. However, I still hope to have a more elegant solution using only Nextflow. I would appreciate some advice.
New feature
It would be extremely useful if GPU usage metrics were recorded for GPU tasks.
Usage scenario
Using GPU resources efficiently on HPC is often a challenge. For example, basecalling Oxford Nanopore sequencing data using the dorado basecaller often takes quite a bit of tuning to get good performance on HPC, for the following reasons: 1) Duplex mode makes heavy use of random access over thousands of files, resulting in low GPU utilization if the shared filesystem cannot keep up. Being able to monitor GPU utilization would allow detecting and mitigating this issue. 2) Dorado exhibits widely different performance on different GPU hardware. HPC nodes often are equipped with heterogenous GPU hardware. When parallelizing dorado jobs, it would be useful to measure the relative performance gaps between different GPU hardware. This information could be used to fine-tune job GPU requirements. (In SLURM or other cluster managers it's usually possible to specify which GPU hardware you're willing to use for a job to, e.g., exclude very old Nvidia architectures that no longer offer acceptable performance for a particular task.) 3) Dorado is a heavy user of GPU VRAM, and crashes when it runs out. Monitoring VRAM usage would help users tune dorado parameters to optimize performance/VRAM usage and know which GPU hardware to request from the cluster manager.
These are very common issues when running dorado on HPC (there are tons of issues on dorado's bug tracker, see, e.g.: https://github.com/nanoporetech/dorado/issues/68, https://github.com/nanoporetech/dorado/issues/336, https://github.com/nanoporetech/dorado/issues/306). This is just a particular example, I imagine the same basic GPU metrics would be useful for most users running GPU tasks with nextflow.
Suggest implementation
An initial implementation could restrict itself to Nvidia GPUs, since those are overwhelmingly the most important for scientific computing.
Use
nvidia-settings
to record the following metrics:(A quick google search turned up this list of ways to programmatically grab GPU metrics: https://unix.stackexchange.com/questions/38560/gpu-usage-monitoring-cuda)
It would be especially useful if in the report HTML, there was a way to look at all metrics broken down by GPU hardware. Perhaps a checkbox of GPU hardware names, and as you select or deselect GPU models, the resulting GPU utilization/VRAM plots change.