Closed gerald2545 closed 4 months ago
Hi @gerald2545, Thanks for the detailed report.
When you say the data is on GPFS or Local NVME - is that local to the slurm working node is some /scratch
directory for example or is it local to the submission node or are they one and the same?
we have a gpfs /work accessible from all the nodes of the cluster. For NVME (sorry if t's not the real name ;()), I mean : local scratch folder only accessible on the GPU node. I didn't not mention : All the A100 GPU are on the same node, with 1024GB RAM for 128 slurm threads
to illustrate a dorado job on our GPU node, here is the htop view (many processes whereas only 8 CPU booked) :
is this a normal "profile" : high load average, many dorado threads (most of them are sleeping), 766 CPU% for the master dorado process thanks again
up for my questions
in the meantime, we managed to use 2 GPU A100 80GB for basecalling a run (problem in the slurm conf which prevented us to do so before)....but still not able to use 2 MGI (A100 40Gb partitions : 2 MIG booked in sbatch , but only one used by dorado)
thank you for your help
Gérald
Hi there,
Were you able to use 2 MIG in the end? I have the same issue, can't use 2 MIG (or more) with dorado
Hi @gerald2545, Apologies for not replying earlier - this completely fell off my radar.
I'm pleased to hear that you found how to add multi-GPU support in slurm. I'm sure @blanleung would be interested to hear what the issue was.
As for the other questions:
how to adjust CPU/RAM booking?
- Currently there are no controls in dorado for CPU / RAM usage Getting 100% VRAM usage
- The auto-batch size calculation activated with
--batchsize 0
(on by default) should do a good job based on the system resources available. This is what we recommend for most users who aren't facing stability issues and in those cases we'd likely suggest lowering the batch size to consume fewer resources. This would impact overall throughput but the application would then be more stable on their hardware.
Apologies again for missing this thread.
Would also be interested to see how you got dorado working with multiple GPUs on slurm. Trying to get modified basecalling to run on my university HPC and encountering the exact same problem as @blanleung - I can allocate multiple MIGs, possibly on separate cards, but can only run dorado on one MIG either way.
@galenptacek
Please see this comment https://github.com/nanoporetech/dorado/issues/812#issuecomment-2114701066 discussing how nvidia doesn't support the use case of multiple MIGs.
Regardless of how many MIG devices are created (or made available to a container) (e.g with --gres=gpu:3), a single CUDA process (dorado) can only enumerate a single MIG device.
Closing this thread as it's not a dorado issue but a slurm configuration / nvidia driver issue.
As we plan to have a large amount of sequencing runs requiring Modified basecalling on our promethion (2xGV100 GPU ) in the next months, we are testing a cluster where 4 x A100 80GB are available (2 x A100 80GB, 1 x A100 with 2 x 40 GB partitions , 1 x A100 with 7 x 10GB partitions) to be able to run the basecalling step offline.
We did different tests : one on a complete run
We regularly checked CPU and VRAM usage during the the basecalling which lasted 41 hours. We noticed that
after this first job, we did several tests on a subset of data (5 pod5, 20000 reads) to try to minimize job duration. Results are compiled in the file DoradoBasecallingComparisonForGithub.xlsx
where we store sbatch options for the job, time for jobs to finish, samples/s, location of dataset. We tried different VRAM amount, number of CPU, location of data (local or on network drive), export LD-PRELOAD or not (see https://github.com/nanoporetech/dorado/issues/567).
Some results are diffcult to interpret :
Globally, it seems that in our case
Some questions :
Is there somewhere (didn't find it here nor in the ONT community) a documentation explaining how dorado works for modification basecalling, influence of batch and chunk size, how to set sbatch ressources.....
It seems using more GPU for dorado can speed up the basecalling, but we do not succeceeded in using more than 1 GPU on slurm for the moment...we are investigating.
Thank you for your help
Run environment: