nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
477 stars 59 forks source link

6mA@v2 running much slower than 6mA@v1 #660

Closed samuelmontgomery closed 6 months ago

samuelmontgomery commented 6 months ago

I am rebasecalling some data with 5.3.0 as I basecalled using 5.0.0 and have the listed issue with the methylation tags not being correct that was fixed in 5.1.0 However, this has introduced the v2 model of 6mA, and I am seeing greatly increased basecalling times with v2 compared to v1 Previously i basecalled my whole dataset in SUP with 6mA in ~15 hours, but after running for 3 hours, it was only 2% complete with a predicted runtime of 5+ days

I have basecalled on a smaller subset of reads (~9000 reads) and the samples/s is 60% slower with v2 compared to v1 (1.15e5 vs 2.87e5), taking approx 2.5x as long to basecall the same data

I am running on a dedicated PC with a 13700K, 64GB RAM, RTX4090 with data on NVME m2 storage

This is 6mA@v2

.\dorado.exe basecaller sup,6mA --no-trim --kit-name SQK-RBK114-24 --recursive $input > C:\Nanopore\21FEB24_NPR003\basecalling_test\23feb24_sup_0.5.3.bam

[2024-03-01 11:49:21.677] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib
[2024-03-01 11:49:24.413] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2 with httplib
[2024-03-01 11:49:24.831] [info] > Creating basecall pipeline
[2024-03-01 11:49:29.372] [info]  - set batch size for cuda:0 to 832
[2024-03-01 11:49:29.383] [info] Barcode for SQK-RBK114-24
[==============================] 100% [20m:33s<00m:00s]
[2024-03-01 12:10:04.463] [info] > Simplex reads basecalled: 8940
[2024-03-01 12:10:04.463] [info] > Simplex reads filtered: 9
[2024-03-01 12:10:04.463] [info] > Basecalled @ Samples/s: 1.154692e+05
[2024-03-01 12:10:04.463] [info] > 8940 reads demuxed @ classifications/s: 7.238480e+00
[2024-03-01 12:10:04.486] [info] > Finished

this is 6mA@v1

.\dorado.exe basecaller sup,6mA@v1 --no-trim --kit-name SQK-RBK114-24 --recursive $input > C:\Nanopore\21FEB24_NPR003\basecalling_test\23feb24_sup_m6av1.bam
[2024-03-01 15:52:47.372] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib
[2024-03-01 15:52:50.126] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v1 with httplib
[2024-03-01 15:52:54.555] [info] > Creating basecall pipeline
[2024-03-01 15:52:59.821] [info]  - set batch size for cuda:0 to 832
[2024-03-01 15:52:59.831] [info] Barcode for SQK-RBK114-24
[==============================] 100% [08m:16s<00m:00s]
[2024-03-01 16:01:17.508] [info] > Simplex reads basecalled: 8940
[2024-03-01 16:01:17.508] [info] > Simplex reads filtered: 9
[2024-03-01 16:01:17.508] [info] > Basecalled @ Samples/s: 2.865630e+05
[2024-03-01 16:01:17.508] [info] > 8940 reads demuxed @ classifications/s: 1.796393e+01
[2024-03-01 16:01:17.526] [info] > Finished

Is this a performance drop observed by anyone else? It is pushing out my total basecalling time from overnight to a whole week for each run, which obviously isn't ideal.

HalfPhoton commented 6 months ago

Hi @samuelmontgomery, Are you able to re-benchmark with a slightly reduced batch size --batchsize 768 to see if this improves performance? Kind regards, Rich

samuelmontgomery commented 6 months ago

@HalfPhoton well that was incredibly fast - finished the whole test subset in 29 seconds! Is it possible to include something in documentation about optimisation for non-A100 GPUs?

I am interested in running both 6mA and 5mC_5hmC models at once but this results in a significantly longer runtime (assuming due to hitting VRAM limit with larger models), would reducing the batch size again benefit from this?

Thanks!

HalfPhoton commented 6 months ago

@samuelmontgomery, we're looking into generating more comprehensive documentation with much more detail than the README. This will include suggestions for performance optimisations on various GPUs.

As for improving the performance while basecalling both mods I would suggest experimenting with your test set. We're continuously working on the batch size algorithm to deliver the best overall performance and stability but this is a tricky task with the variety of systems and models.