Dorado offline on slurm, VRAM usage

gerald2545 commented 7 months ago

As we plan to have a large amount of sequencing runs requiring Modified basecalling on our promethion (2xGV100 GPU ) in the next months, we are testing a cluster where 4 x A100 80GB are available (2 x A100 80GB, 1 x A100 with 2 x 40 GB partitions , 1 x A100 with 7 x 10GB partitions) to be able to run the basecalling step offline.

We did different tests : one on a complete run

#!/bin/bash
#SBATCH -p gpuq
#SBATCH -J dorado-pod5_5mCG-GS
#SBATCH -o %x%j.out
#SBATCH -e %x%j.err
#SBATCH -t 04-00:00:00
#SBATCH --mem=256G
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:A100:1

/work/dorado/dorado-0.5.3-linux-x64/bin/dorado basecaller -r --verbose \
/work/dorado/models/dna_r10.4.1_e8.2_400bps_sup@v4.1.0/ \
/work/ONT_test-GPU_rebasecalling/20220920_ed1cc922_PC24B191/pod5/ \
--modified-bases 5mCG_5hmCG \
--reference /work/ONT_test-GPU_rebasecalling/References/reference_genomic.fa > result.bam

We regularly checked CPU and VRAM usage during the the basecalling which lasted 41 hours. We noticed that

~50% of the available VRAM was used during the basecalling (is this normal on a A100 GPU, with dedicated VRAM, where no other process can use the VRAM - I saw in another issue/post that you limited to 50% to avoid OOM).
the CPU RAM is only used at the very end of the dorado process, using ~68GB whereas for the first hours it useed ~0,01% of the 256GB RAM booked

after this first job, we did several tests on a subset of data (5 pod5, 20000 reads) to try to minimize job duration. Results are compiled in the file DoradoBasecallingComparisonForGithub.xlsx

where we store sbatch options for the job, time for jobs to finish, samples/s, location of dataset. We tried different VRAM amount, number of CPU, location of data (local or on network drive), export LD-PRELOAD or not (see https://github.com/nanoporetech/dorado/issues/567).

Some results are diffcult to interpret :

especially test numbers 9 and 10 where test 10 (manually setting the batch size) has a inferior samples/s rate but also an inferior job duration , compared to test 9 (maybe due to the small amount reads in the dataset?)
exporting LD-PRELOAD before calling dorado decreased the number of sleeping dorado CPU threads, but did not increase the whole job speed, nor decrease) : cf results of tests 4 and 5

Globally, it seems that in our case

more CPU booked, the better it is,
more VRAM we have, the better it is
the CPU RAM does not need to be adjusted according to CPU/GPU booked, but to be adjusted according to input data
no impact of the data location (GPFS drive or local NVME drive)

Some questions :

We are quite disappointed not to be able to use the entire available VRAM ressource.....I tried to manually set batch size to use ~100% of the VRAM on a subset (5 pod5, 20000 reads)...but did not notice a great improvement...how can it be explained, what could be the bottleneck here?
how to adjust CPU booking?

Is there somewhere (didn't find it here nor in the ONT community) a documentation explaining how dorado works for modification basecalling, influence of batch and chunk size, how to set sbatch ressources.....

It seems using more GPU for dorado can speed up the basecalling, but we do not succeceeded in using more than 1 GPU on slurm for the moment...we are investigating.

Thank you for your help

Run environment:

Dorado version: 0.5.3
Dorado command: /work/dorado/dorado-0.5.3-linux-x64/bin/dorado basecaller -r --verbose \ /work/dorado/models/dna_r10.4.1_e8.2_400bps_sup@v4.1.0/ \ /work/ONT_test-GPU_rebasecalling/20220920_ed1cc922_PC24B191/pod5/ \ --modified-bases 5mCG_5hmCG \ --reference /work/ONT_test-GPU_rebasecalling/References/reference_genomic.fa > result.bam
Operating system: redhat 8 I guess (unale -a : 4.18.0-477.10.1.el8_8.x86_64)
Hardware (CPUs, Memory, GPUs): see above for sbatch options
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): fast5 converted to pod5
Source data location (on device or networked drive - NFS, etc.) : network drive GPFS (~ same results on local NVME drive or GPFS drive obtained on a subset of 5 pod5 files - 20000 reads )
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): Pore type : R10.4.1, Kit type : SQK-LSK114, Total bases : 138.72Gb, Translocation speed : 400bps, Data sampling frequency : 4kHz, N50 : 19.8kb

HalfPhoton commented 7 months ago

Hi @gerald2545, Thanks for the detailed report.

When you say the data is on GPFS or Local NVME - is that local to the slurm working node is some /scratch directory for example or is it local to the submission node or are they one and the same?

gerald2545 commented 7 months ago

we have a gpfs /work accessible from all the nodes of the cluster. For NVME (sorry if t's not the real name ;()), I mean : local scratch folder only accessible on the GPU node. I didn't not mention : All the A100 GPU are on the same node, with 1024GB RAM for 128 slurm threads

to illustrate a dorado job on our GPU node, here is the htop view (many processes whereas only 8 CPU booked) : htopGPU01

is this a normal "profile" : high load average, many dorado threads (most of them are sleeping), 766 CPU% for the master dorado process thanks again

gerald2545 commented 7 months ago

up for my questions

We are quite disappointed not to be able to use the entire available VRAM ressource.....I tried to manually set batch size to use ~100% of the VRAM on a subset (5 pod5, 20000 reads)...but did not notice a great improvement...how can it be explained, what could be the bottleneck here?
how to adjust CPU/RAM booking?

in the meantime, we managed to use 2 GPU A100 80GB for basecalling a run (problem in the slurm conf which prevented us to do so before)....but still not able to use 2 MGI (A100 40Gb partitions : 2 MIG booked in sbatch , but only one used by dorado)

thank you for your help

Gérald

blanleung commented 4 months ago

Hi there,

Were you able to use 2 MIG in the end? I have the same issue, can't use 2 MIG (or more) with dorado

HalfPhoton commented 4 months ago

Hi @gerald2545, Apologies for not replying earlier - this completely fell off my radar.

I'm pleased to hear that you found how to add multi-GPU support in slurm. I'm sure @blanleung would be interested to hear what the issue was.

As for the other questions:

how to adjust CPU/RAM booking?

Currently there are no controls in dorado for CPU / RAM usage Getting 100% VRAM usage

The auto-batch size calculation activated with --batchsize 0 (on by default) should do a good job based on the system resources available. This is what we recommend for most users who aren't facing stability issues and in those cases we'd likely suggest lowering the batch size to consume fewer resources. This would impact overall throughput but the application would then be more stable on their hardware.

Apologies again for missing this thread.

galenptacek commented 4 months ago

Would also be interested to see how you got dorado working with multiple GPUs on slurm. Trying to get modified basecalling to run on my university HPC and encountering the exact same problem as @blanleung - I can allocate multiple MIGs, possibly on separate cards, but can only run dorado on one MIG either way.

HalfPhoton commented 4 months ago

@galenptacek

Please see this comment https://github.com/nanoporetech/dorado/issues/812#issuecomment-2114701066 discussing how nvidia doesn't support the use case of multiple MIGs.

Regardless of how many MIG devices are created (or made available to a container) (e.g with --gres=gpu:3), a single CUDA process (dorado) can only enumerate a single MIG device.

Closing this thread as it's not a dorado issue but a slurm configuration / nvidia driver issue.

nanoporetech / dorado

Dorado offline on slurm, VRAM usage #634

Run environment: