Can't use multi-instance GPU (MIG)

nanoporetech / dorado

Oxford Nanopore's Basecaller

https://nanoporetech.com/

Other

445 stars 54 forks source link

Can't use multi-instance GPU (MIG) #812

Closed blanleung closed 1 month ago

blanleung commented 1 month ago

Hi all,

My HPC uses multi-instance GPU (MIG). Despite booking several MIG with slurm (--gres=gpu:3) it seems Dorado uses only 1. I have tried running dorado with --device cuda:0,1,2 but I get the following error message:

[error] Invalid CUDA device index "1" from device string ,1,2, there are 1 visible CUDA devices.

Similar issue observed in #634

Any way for Dorado to make use of several MIGs? Or is that a feature not implemented in Dorado?

Many thanks!

HalfPhoton commented 1 month ago

Hi @blanleung, After checking the source code around this error message it its entirely based on the nvidia cuda drivers reporting the number of available devices in your instance.

This is probably not an issue with Cuda, Torch and Dorado and instead likely lies with your Slurm configuration.

May I recommend investigating some stack overflow threads such as this potentially useful example.

Some things I'd check would be if nvidia-smi correctly shows all N GPUs you're requesting, and if the environment variable CUDA_VISIBLE_DEVICES is set and to what value.

I'd suggest also contacting your Slurm cluster administrators if they have any specific relevant policies / cgroups around gpu allocations or experience in what could be informing the cuda drivers to report 1 GPU when N are requested.

Kind regards, Rich

blanleung commented 1 month ago

This is on the GPU node with --gres=gpu:2. nvidia-smi correctly shows 1 GPU, 3 MIGs, but only 1 MIG is used.

I will check CUDA_VISIBLE_DEVICES

HalfPhoton commented 1 month ago

Reading the MIG CUDA Device Enumeration documentation

[!NOTE] MIG supports running CUDA applications by specifying the CUDA device on which the application should be run. With CUDA 11/R450 and CUDA 12/R525, only enumeration of a single MIG instance is supported. In other words, regardless of how many MIG devices are created (or made available to a container), a single CUDA process can only enumerate a single MIG device.

I might be mis-interpreting this but I read it as - regardless of how many MIG devices are created (or made available to a container) (e.g with --gres=gpu:3), a single CUDA process (dorado) can only enumerate a single MIG device.

CUDA is limited to use a single CI and will pick the first one available if several of them are visible

blanleung commented 1 month ago

So the solution would be to split the data in batch (e.g 3) and process them with dorado in parallel (request --gres=gpu:1 for each process)?

HalfPhoton commented 1 month ago

That sounds like a sensible solution yes - that could be set up with N directories of symbolic links. I believe that dorado now writes sorted bam outputs so they should be trivial to merge if you need to have a monolithic final output.

HalfPhoton commented 1 month ago

@blanleung - Looking back at the nvidia-smi output. You have been allocated 1 physical GPU split into 3 MIG devices. Your slurm configuration --gres=gpu:3 is not giving you 3 GPUs.

I'm going to close this ticket as it's not a dorado issue.