nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
495 stars 59 forks source link

Dorado offline on slurm, VRAM usage #634

Closed gerald2545 closed 4 months ago

gerald2545 commented 7 months ago

As we plan to have a large amount of sequencing runs requiring Modified basecalling on our promethion (2xGV100 GPU ) in the next months, we are testing a cluster where 4 x A100 80GB are available (2 x A100 80GB, 1 x A100 with 2 x 40 GB partitions , 1 x A100 with 7 x 10GB partitions) to be able to run the basecalling step offline.

We did different tests : one on a complete run

#!/bin/bash
#SBATCH -p gpuq
#SBATCH -J dorado-pod5_5mCG-GS
#SBATCH -o %x%j.out
#SBATCH -e %x%j.err
#SBATCH -t 04-00:00:00
#SBATCH --mem=256G
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:A100:1

/work/dorado/dorado-0.5.3-linux-x64/bin/dorado basecaller -r --verbose \
/work/dorado/models/dna_r10.4.1_e8.2_400bps_sup@v4.1.0/ \
/work/ONT_test-GPU_rebasecalling/20220920_ed1cc922_PC24B191/pod5/ \
--modified-bases 5mCG_5hmCG \
--reference /work/ONT_test-GPU_rebasecalling/References/reference_genomic.fa > result.bam

We regularly checked CPU and VRAM usage during the the basecalling which lasted 41 hours. We noticed that

after this first job, we did several tests on a subset of data (5 pod5, 20000 reads) to try to minimize job duration. Results are compiled in the file DoradoBasecallingComparisonForGithub.xlsx

where we store sbatch options for the job, time for jobs to finish, samples/s, location of dataset. We tried different VRAM amount, number of CPU, location of data (local or on network drive), export LD-PRELOAD or not (see https://github.com/nanoporetech/dorado/issues/567).

Some results are diffcult to interpret :

Globally, it seems that in our case

Some questions :

Is there somewhere (didn't find it here nor in the ONT community) a documentation explaining how dorado works for modification basecalling, influence of batch and chunk size, how to set sbatch ressources.....

It seems using more GPU for dorado can speed up the basecalling, but we do not succeceeded in using more than 1 GPU on slurm for the moment...we are investigating.

Thank you for your help

Run environment:

HalfPhoton commented 7 months ago

Hi @gerald2545, Thanks for the detailed report.

When you say the data is on GPFS or Local NVME - is that local to the slurm working node is some /scratch directory for example or is it local to the submission node or are they one and the same?

gerald2545 commented 7 months ago

we have a gpfs /work accessible from all the nodes of the cluster. For NVME (sorry if t's not the real name ;()), I mean : local scratch folder only accessible on the GPU node. I didn't not mention : All the A100 GPU are on the same node, with 1024GB RAM for 128 slurm threads

to illustrate a dorado job on our GPU node, here is the htop view (many processes whereas only 8 CPU booked) : htopGPU01

is this a normal "profile" : high load average, many dorado threads (most of them are sleeping), 766 CPU% for the master dorado process thanks again

gerald2545 commented 7 months ago

up for my questions

in the meantime, we managed to use 2 GPU A100 80GB for basecalling a run (problem in the slurm conf which prevented us to do so before)....but still not able to use 2 MGI (A100 40Gb partitions : 2 MIG booked in sbatch , but only one used by dorado)

thank you for your help

Gérald

blanleung commented 4 months ago

Hi there,

Were you able to use 2 MIG in the end? I have the same issue, can't use 2 MIG (or more) with dorado

HalfPhoton commented 4 months ago

Hi @gerald2545, Apologies for not replying earlier - this completely fell off my radar.

I'm pleased to hear that you found how to add multi-GPU support in slurm. I'm sure @blanleung would be interested to hear what the issue was.

As for the other questions:

how to adjust CPU/RAM booking?

  • Currently there are no controls in dorado for CPU / RAM usage Getting 100% VRAM usage
  • The auto-batch size calculation activated with --batchsize 0 (on by default) should do a good job based on the system resources available. This is what we recommend for most users who aren't facing stability issues and in those cases we'd likely suggest lowering the batch size to consume fewer resources. This would impact overall throughput but the application would then be more stable on their hardware.

Apologies again for missing this thread.

galenptacek commented 4 months ago

Would also be interested to see how you got dorado working with multiple GPUs on slurm. Trying to get modified basecalling to run on my university HPC and encountering the exact same problem as @blanleung - I can allocate multiple MIGs, possibly on separate cards, but can only run dorado on one MIG either way.

HalfPhoton commented 4 months ago

@galenptacek

Please see this comment https://github.com/nanoporetech/dorado/issues/812#issuecomment-2114701066 discussing how nvidia doesn't support the use case of multiple MIGs.

Regardless of how many MIG devices are created (or made available to a container) (e.g with --gres=gpu:3), a single CUDA process (dorado) can only enumerate a single MIG device.

Closing this thread as it's not a dorado issue but a slurm configuration / nvidia driver issue.