Can we be even faster ? :)

RxLoutre commented 8 months ago

Hi dorado team !

I am using the latest version of dorado on a Nvidia A100 card, which is supposed to be what dorado is optimised for.

I would like to ask for some advices (if possible) on how to improve basecalling speed.

Ideally, I would like to always use the SUP models. However, basecalling time still seem to be a bit elongated.

For a test run, Multiplexed, which generated about 125 million reads, for around 46 Gbases, it took Guppy 1day:9hours and it took Dorado 1day:1hours to do the same.

I have saved the benchmark stats of each run which I am sharing here :

Guppy benchmark gpu usage : s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time 119791.1566 1 day, 9:16:31 19093.61 95780.96 19087.37 19088.45 710623.28 209381.71 201.46 241332.74

Dorado benchmark gpu usage : s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time 90650.3641 1 day, 1:10:50 24330.56 90354.70 24326.39 24327.46 715974.84 46866.53 141.22 128018.11

I'm really no biggie expert on this matter, so correct me if I am wrong. But I see that guppy had a better "mean_load" compared to dorado.

Guppy was running with a given samplesheet and asked to do demux at the same time using this samplesheet. And dorado, I asked the same task, but in two steps. First pod5 -> bam and then dorado demux on this bam while using the --emit-fastq option. But this step is not being accounted here, just mentionning it for information.

Dorado command :

dorado basecaller /path/to/model/dna_r10.4.1_e8.2_400bps_sup@v4.3.0 /path/to/data/pod5 > /path/to/output/calls.bam

I am not using any of the options such as batch-size or chunk-size, mostly because I don't know how to use them properly to optimize, and also because I assumed our big GPU would be enough.

However, I am a bit afraid that when we start producing more data (we expect at least double the output previously mentionned on our good P2 solo runs). And I am not sure if for our applications we can afford 2 days of basecalling. Because our big GPU is not connected to the computer controlling MinKnow, we have to do basecalling after the experiment is over, live basecalling is not possible for our case alas.

Any recommendations to optimize basecalling speed on SUP models ? Or should I turn me head toward HAC models for runs producing more data ?

Best,

Roxane

vellamike commented 8 months ago

Could you report what speed Dorado reported (in sample/s) at the end of your run? 46GBases in 25 hours sounds a little bit slow.

RxLoutre commented 8 months ago

Yes of course !

here is the output :

[2024-01-12 15:54:24.300] [info] > Creating basecall pipeline
[2024-01-12 16:36:22.614] [info]  - set batch size for cuda:0 to 1792
[2024-01-13 17:04:59.312] [info] > Simplex reads basecalled: 121201153
[2024-01-13 17:04:59.312] [info] > Simplex reads filtered: 4604
[2024-01-13 17:04:59.312] [info] > Basecalled @ Samples/s: 9.735525e+06
[2024-01-13 17:05:13.011] [info] > Finished

RxLoutre commented 8 months ago

Here is a thought : This is a run with many many short reads. Could it be what slow downs dorado ? Does the number of read per pod5 impact dorado's speed ?

tijyojwad commented 8 months ago

Hi @RxLoutre - thanks for sharing that! Indeed that could be an issue since dorado isn't automatically optimized for short reads yet (although we're working on that). Can you share your read length distribution? We can help you find a better chunk size which will speed up your basecalling in the meantime

RxLoutre commented 8 months ago

Reads range from 20bp to 10kbp. The median length is 292bp and the N50 is 389bp, so mostly quite short. We should not have too many runs with this setup, but it can still be insightful to know how to adjust the chunk size to increase basecalling speed ! Best, Roxane

tijyojwad commented 8 months ago

Hi @RxLoutre - you can try to run with chunk size of 5000 (i.e. -c 5000). this should provide you with some speedup. However, for longer reads, reducing the chunk size can impact the accuracy of the reads a bit. Just something to be aware of.

RxLoutre commented 8 months ago

Thank you for your feedback ! :)

I will give it a try and stay aware of the impact of chunk size depending on reads size.

Best,

dhlee3342 commented 5 months ago

Hi @RxLoutre - do you have any update for the result? I used dorado for modified basecalls of my adaptive sampling data, but it was a bit slow.

Thank you,

tijyojwad commented 5 months ago

Hi @dhlee3342 can you post more details of your dataset and run on another issue? we can help debug it

dhlee3342 commented 5 months ago

Hi @tijyojwad, here's my log. It was my first ont run, I am not aware of the benchmark. I used four A100. It was an adaptive sampling, pod5 files vary by size.

$ dorado basecaller hac,5mCG_5hmCG pod5_pass/barcode01/ > bam_mod/barcode01.bam

2024-04-20 21:47:59.057] [info] > Creating basecall pipeline
[2024-04-20 21:58:49.506] [info]  - set batch size for cuda:0 to 8064
[2024-04-20 21:58:49.790] [info]  - set batch size for cuda:1 to 8064
[2024-04-20 21:58:50.061] [info]  - set batch size for cuda:2 to 8064
[2024-04-20 21:58:50.283] [info]  - set batch size for cuda:3 to 8064
[2024-04-21 09:37:24.741] [info] > Simplex reads basecalled: 22697324
[2024-04-21 09:37:24.797] [info] > Simplex reads filtered: 411
[2024-04-21 09:37:24.797] [info] > Basecalled @ Samples/s: 6.490823e+06
[2024-04-21 09:37:30.431] [info] > Finished
Sun Apr 21 09:37:36 EDT 2024

nanoporetech / dorado

Can we be even faster ? :) #580