Closed blanleung closed 1 month ago
Hi @blanleung, After checking the source code around this error message it its entirely based on the nvidia cuda drivers reporting the number of available devices in your instance.
This is probably not an issue with Cuda, Torch and Dorado and instead likely lies with your Slurm configuration.
May I recommend investigating some stack overflow threads such as this potentially useful example.
Some things I'd check would be if nvidia-smi
correctly shows all N GPUs you're requesting, and if the environment variable CUDA_VISIBLE_DEVICES
is set and to what value.
I'd suggest also contacting your Slurm cluster administrators if they have any specific relevant policies / cgroups around gpu allocations or experience in what could be informing the cuda drivers to report 1 GPU when N are requested.
Kind regards, Rich
This is on the GPU node with --gres=gpu:2. nvidia-smi
correctly shows 1 GPU, 3 MIGs, but only 1 MIG is used.
I will check CUDA_VISIBLE_DEVICES
Reading the MIG CUDA Device Enumeration documentation
[!NOTE] MIG supports running CUDA applications by specifying the CUDA device on which the application should be run. With CUDA 11/R450 and CUDA 12/R525, only enumeration of a single MIG instance is supported. In other words, regardless of how many MIG devices are created (or made available to a container), a single CUDA process can only enumerate a single MIG device.
I might be mis-interpreting this but I read it as - regardless of how many MIG devices are created (or made available to a container) (e.g with --gres=gpu:3
), a single CUDA process (dorado
) can only enumerate a single MIG device.
CUDA is limited to use a single CI and will pick the first one available if several of them are visible
So the solution would be to split the data in batch (e.g 3) and process them with dorado
in parallel (request --gres=gpu:1
for each process)?
That sounds like a sensible solution yes - that could be set up with N directories of symbolic links. I believe that dorado now writes sorted bam outputs so they should be trivial to merge if you need to have a monolithic final output.
@blanleung - Looking back at the nvidia-smi output. You have been allocated 1 physical GPU split into 3 MIG devices.
Your slurm configuration --gres=gpu:3
is not giving you 3 GPUs.
I'm going to close this ticket as it's not a dorado issue.
Hi all,
My HPC uses multi-instance GPU (MIG). Despite booking several MIG with slurm (
--gres=gpu:3
) it seems Dorado uses only 1. I have tried running dorado with--device cuda:0,1,2
but I get the following error message:[error] Invalid CUDA device index "1" from device string ,1,2, there are 1 visible CUDA devices.
Similar issue observed in #634
Any way for Dorado to make use of several MIGs? Or is that a feature not implemented in Dorado?
Many thanks!