Drastically reduced basecalling performance with Dorado using RTX4090

Hi, I had optimised the batch size parameter (as noted in #660) using a test subset of my data which saw drastic improvements in 0.5.3 (Samples/s: 1.034746e+05 vs Samples/s: 3.626776e+06)

I have updated to dorado v0.6.0 running the same dataset, and I am seeing a drastic reduction in basecalling speed

Run on 0.5.3: \dorado.exe basecaller sup,6mA,5mC_5hmC --no-trim --kit-name SQK-RBK114-24 --recursive pod5 > basecalled.bam

[2024-03-01 16:32:56.989] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib [2024-03-01 16:32:59.761] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2 with httplib [2024-03-01 16:33:00.250] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mC_5hmC@v1 with httplib [2024-03-01 16:33:01.129] [info] > Creating basecall pipeline [2024-03-01 16:33:05.692] [info] - set batch size for cuda:0 to 832 [2024-03-01 16:33:05.701] [info] Barcode for SQK-RBK114-24 [==============================] 100% [22m:56s<00m:00s] [2024-03-01 16:56:03.948] [info] > Simplex reads basecalled: 8940 [2024-03-01 16:56:03.949] [info] > Simplex reads filtered: 9 [2024-03-01 16:56:03.949] [info] > Basecalled @ Samples/s: 1.034746e+05 [2024-03-01 16:56:03.949] [info] > 8940 reads demuxed @ classifications/s: 6.486566e+00 [2024-03-01 16:56:03.997] [info] > Finished

Run on 0.6.0:

[2024-04-04 13:17:52.838] [info] Running: "basecaller" "sup,6mA,5mC_5hmC" "--no-trim" "--kit-name" "SQK-RBK114-24" "--recursive" "C:\Nanopore\21FEB24_NPR003\no_sample\20240221_1554_MN42052_FAW43460_c4302e0a\pod5_pass" [2024-04-04 13:17:53.015] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib [2024-04-04 13:17:55.902] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2 with httplib [2024-04-04 13:17:56.335] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mC_5hmC@v1 with httplib [2024-04-04 13:17:56.767] [info] > Creating basecall pipeline [2024-04-04 13:18:07.929] [info] cuda:0 using chunk size 9996, batch size 896 [2024-04-04 13:18:08.732] [info] cuda:0 using chunk size 4998, batch size 1024 [=========> ] 30% [22m:44s<53m:04s] Basecalling

This is not improved by manually setting batch size to my previously optimised value where the same dataset took 1m39s to complete, and running with --verbose, 0.6.0 seems run two different models or split GPU usage into two?

[2024-04-04 14:03:04.091] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib [2024-04-04 14:03:06.422] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2 with httplib [2024-04-04 14:03:06.854] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mC_5hmC@v1 with httplib [2024-04-04 14:03:07.315] [info] > Creating basecall pipeline [2024-04-04 14:03:10.185] [debug] cuda:0 memory available: 22.74GB [2024-04-04 14:03:10.185] [debug] cuda:0 memory limit 21.74GB [2024-04-04 14:03:10.185] [debug] cuda:0 maximum safe estimated batch size at chunk size 9996 is 896 [2024-04-04 14:03:10.186] [debug] cuda:0 maximum safe estimated batch size at chunk size 4998 is 1792 [2024-04-04 14:03:10.186] [debug] Auto batchsize cuda:0: testing up to 1792 in steps of 64 [2024-04-04 14:03:10.241] [debug] Auto batchsize cuda:0: 64, time per chunk 0.369168 ms [2024-04-04 14:03:10.328] [debug] Auto batchsize cuda:0: 128, time per chunk 0.289856 ms [2024-04-04 14:03:10.447] [debug] Auto batchsize cuda:0: 192, time per chunk 0.267727 ms [2024-04-04 14:03:10.558] [debug] Auto batchsize cuda:0: 256, time per chunk 0.167559 ms [2024-04-04 14:03:10.707] [debug] Auto batchsize cuda:0: 320, time per chunk 0.187293 ms [2024-04-04 14:03:10.863] [debug] Auto batchsize cuda:0: 384, time per chunk 0.159005 ms [2024-04-04 14:03:11.031] [debug] Auto batchsize cuda:0: 448, time per chunk 0.154523 ms [2024-04-04 14:03:11.213] [debug] Auto batchsize cuda:0: 512, time per chunk 0.141154 ms [2024-04-04 14:03:11.428] [debug] Auto batchsize cuda:0: 576, time per chunk 0.148990 ms [2024-04-04 14:03:11.652] [debug] Auto batchsize cuda:0: 640, time per chunk 0.139659 ms [2024-04-04 14:03:11.901] [debug] Auto batchsize cuda:0: 704, time per chunk 0.142097 ms [2024-04-04 14:03:12.164] [debug] Auto batchsize cuda:0: 768, time per chunk 0.137096 ms [2024-04-04 14:03:12.435] [debug] Auto batchsize cuda:0: 832, time per chunk 0.126542 ms [2024-04-04 14:03:12.716] [debug] Auto batchsize cuda:0: 896, time per chunk 0.121982 ms [2024-04-04 14:03:13.004] [debug] Auto batchsize cuda:0: 960, time per chunk 0.116803 ms [2024-04-04 14:03:13.311] [debug] Auto batchsize cuda:0: 1024, time per chunk 0.114844 ms [2024-04-04 14:03:13.702] [debug] Auto batchsize cuda:0: 1088, time per chunk 0.144568 ms [2024-04-04 14:03:14.105] [debug] Auto batchsize cuda:0: 1152, time per chunk 0.137991 ms [2024-04-04 14:03:14.524] [debug] Auto batchsize cuda:0: 1216, time per chunk 0.135407 ms [2024-04-04 14:03:14.950] [debug] Auto batchsize cuda:0: 1280, time per chunk 0.130262 ms [2024-04-04 14:03:15.390] [debug] Auto batchsize cuda:0: 1344, time per chunk 0.126361 ms [2024-04-04 14:03:15.839] [debug] Auto batchsize cuda:0: 1408, time per chunk 0.123622 ms [2024-04-04 14:03:16.300] [debug] Auto batchsize cuda:0: 1472, time per chunk 0.120850 ms [2024-04-04 14:03:16.772] [debug] Auto batchsize cuda:0: 1536, time per chunk 0.118133 ms [2024-04-04 14:03:17.320] [debug] Auto batchsize cuda:0: 1600, time per chunk 0.135572 ms [2024-04-04 14:03:17.881] [debug] Auto batchsize cuda:0: 1664, time per chunk 0.131735 ms [2024-04-04 14:03:18.428] [debug] Auto batchsize cuda:0: 1728, time per chunk 0.121474 ms [2024-04-04 14:03:18.985] [debug] Auto batchsize cuda:0: 1792, time per chunk 0.118571 ms [2024-04-04 14:03:19.012] [debug] Largest batch size for cuda:0: 1024, time per chunk 0.114844 ms

[2024-04-04 14:03:19.013] [info] cuda:0 using chunk size 9996, batch size 896 [2024-04-04 14:03:19.013] [debug] cuda:0 Model memory 15.29GB [2024-04-04 14:03:19.013] [debug] cuda:0 Decode memory 6.32GB [2024-04-04 14:03:19.863] [info] cuda:0 using chunk size 4998, batch size 1024 [2024-04-04 14:03:19.864] [debug] cuda:0 Model memory 8.73GB [2024-04-04 14:03:19.864] [debug] cuda:0 Decode memory 3.61GB [2024-04-04 14:03:51.831] [debug] - adjusted chunk size to match model stride: 10000 -> 9996 [2024-04-04 14:03:51.841] [debug] Creating barcoding info for kit: SQK-RBK114-24 [2024-04-04 14:03:51.841] [debug] Barcode for SQK-RBK114-24 [2024-04-04 14:03:51.841] [debug] - adjusted overlap to match model stride: 500 -> 498 [2024-04-04 14:03:51.842] [debug] BasecallerNode chunk size 9996 [2024-04-04 14:03:51.842] [debug] BasecallerNode chunk size 4998

I've noticed a difference in how it is setting batch size automatically (and seems to be changing the chunk size? won't that result in different basecalls? I thought it was tied to hardware), but it is only ~30% complete in the same run time

I know the changelog mentions optimisations for running on Apple M chips, but these seem to be coming at a serious loss of performance for RTX GPUs (i.e. PCs for running the P2 Solo)

Hi @samuelmontgomery - the 2 chunk sizes are for dorado to handle shorter reads in a more performant manner, since running shorter reads with the big chunks led to under utilization of the GPU.

It would help to understand the read distribution in your dataset to see what's going on. Could you share that?

Hi @tijyojwad - the subset of reads I am running has an N50 of 2.75kB, but has a lot of shorter reads (as it's just the first ~10,000 reads from the run)

I find with the default batch size selected for the GPU (even with 0.5.3), the GPU says it is running with a 100% load, but doesn't seem to actually utilise the capacity of the GPU and stays running at ambient temperatures (~30C) On 0.5.3 when selecting a smaller batch size, it also runs at 100% load but actually seems to be working, with the temperature running at ~65C during basecalling

I am not seeing drastic reduction in speed when compare to 0.5.3 and 0.4.3 using my 4xL40S which has a similar architecture to 4090.

But then I am only running 4.3.0 5mCG_5hCG@v1 and no 6mA.

Hi @samuelmontgomery - sorry for the late reply! I was wondering if you could do this experiment without mod basecalling? Trying to see if it's a canonical basecall speed issue or mod calling

thanks @tijyojwad

I have repeated the set of reads without modification (just sup@v4.3.0) with both 0.6.0 and 0.5.3 with --batchsize set to 704, with auto batch size, and with 5mC_5hmC and 6mA modifications run individually and together

0.6.0 ranges from between ~2x slower to ~87x slower than 0.5.3

0.5.3 (batch size 704)

C:\dorado_0.5.3\bin\dorado.exe basecaller sup --no-trim --kit-name SQK-RBK114-24 --recursive --batchsize 704 .\pod5_pass > .\batch_0.5.3_sup.bam [2024-04-19 11:10:08.141] [info] > Simplex reads basecalled: 8940 [2024-04-19 11:10:08.141] [info] > Simplex reads filtered: 9 [2024-04-19 11:10:08.141] [info] > Basecalled @ Samples/s: 7.909272e+06 [2024-04-19 11:10:08.141] [info] > 8940 reads demuxed @ classifications/s: 4.958128e+02 [2024-04-19 11:10:08.180] [info] > Finished

0.6.0 (batch size 704)

C:\dorado_0.6.0\bin\dorado.exe basecaller sup --no-trim --kit-name SQK-RBK114-24 --recursive --batchsize 704 .\pod5_pass > .\batch_0.6_sup.bam [2024-04-19 11:14:08.482] [info] > Simplex reads basecalled: 8926 [2024-04-19 11:14:08.482] [info] > Simplex reads filtered: 23 [2024-04-19 11:14:08.482] [info] > Basecalled @ Samples/s: 3.970008e+06 [2024-04-19 11:14:08.482] [info] > 9100 reads demuxed @ classifications/s: 3.073078e+02 [2024-04-19 11:14:08.498] [info] > Finished

As you can see, 0.6.0 is about ~2x slower than 0.5.3 with batch size 704

0.5.3 (auto batch)

C:\dorado_0.5.3\bin\dorado.exe basecaller sup --no-trim --kit-name SQK-RBK114-24 --recursive .\pod5_pass > .\batch_0.5.3_sup.bam [2024-04-19 11:18:17.036] [info] > Simplex reads basecalled: 8940 [2024-04-19 11:18:17.036] [info] > Simplex reads filtered: 9 [2024-04-19 11:18:17.036] [info] > Basecalled @ Samples/s: 8.650497e+06 [2024-04-19 11:18:17.036] [info] > 8940 reads demuxed @ classifications/s: 5.422783e+02 [2024-04-19 11:18:17.068] [info] > Finished

0.6.0 (auto batch)

C:\dorado\bin\dorado.exe basecaller sup --no-trim --kit-name SQK-RBK114-24 --recursive .\pod5_pass > .\batch_0.6_sup.bam [2024-04-19 11:22:58.855] [info] > Simplex reads basecalled: 8926 [2024-04-19 11:22:58.855] [info] > Simplex reads filtered: 23 [2024-04-19 11:22:58.856] [info] > Basecalled @ Samples/s: 9.344314e+05 [2024-04-19 11:22:58.856] [info] > 9100 reads demuxed @ classifications/s: 7.233187e+01 [2024-04-19 11:22:58.873] [info] > Finished

As you can see, this time 0.6.0 was ~10x slower than 0.5.3

0.5.3 (5mC_5hmC, batch size 704)

C:\dorado_0.5.3\bin\dorado.exe basecaller sup,5mC_5hmC --no-trim --kit-name SQK-RBK114-24 --recursive --batchsize 704 .\pod5_pass > .\batch_0.5.3_5mC.bam [==============================] 100% [00m:26s<00m:00s] [2024-04-19 11:32:00.505] [info] > Simplex reads basecalled: 8940 [2024-04-19 11:32:00.505] [info] > Simplex reads filtered: 9 [2024-04-19 11:32:00.505] [info] > Basecalled @ Samples/s: 5.170664e+06 [2024-04-19 11:32:00.505] [info] > 8940 reads demuxed @ classifications/s: 3.241362e+02 [2024-04-19 11:32:00.541] [info] > Finished

0.6.0 (5mC_5hmC, batch size 704)

C:\dorado\bin\dorado.exe basecaller sup,5mC_5hmC --no-trim --kit-name SQK-RBK114-24 --recursive --batchsize 704 .\pod5_pass > .\batch_0.6_5mC.bam [2024-04-19 11:49:57.575] [info] > Simplex reads basecalled: 8926 [2024-04-19 11:49:57.575] [info] > Simplex reads filtered: 23 [2024-04-19 11:49:57.576] [info] > Basecalled @ Samples/s: 1.218453e+05 [2024-04-19 11:49:57.576] [info] > 9100 reads demuxed @ classifications/s: 9.431723e+00 [2024-04-19 11:49:57.598] [info] > Finished

Again, 0.6.0 is ~42x slower than 0.5.3 with 5mC_5hmC modification

0.5.3 (6mA, batch size 704)

C:\dorado_0.5.3\bin\dorado.exe basecaller sup,6mA --no-trim --kit-name SQK-RBK114-24 --recursive --batchsize 704 .\pod5_pass > .\batch_0.5.3_6mA.bam [2024-04-19 12:55:58.322] [info] > Simplex reads basecalled: 8940 [2024-04-19 12:55:58.322] [info] > Simplex reads filtered: 9 [2024-04-19 12:55:58.323] [info] > Basecalled @ Samples/s: 4.300856e+06 [2024-04-19 12:55:58.323] [info] > 8940 reads demuxed @ classifications/s: 2.696101e+02 [2024-04-19 12:55:58.371] [info] > Finished

0.6.0 (6mA, batch size 704)

C:\dorado\bin\dorado.exe basecaller sup,6mA --no-trim --kit-name SQK-RBK114-24 --recursive --batchsize 704 .\pod5_pass > ./batch_0.6_6mA.bam [2024-04-19 13:25:14.644] [info] > Simplex reads basecalled: 8926 [2024-04-19 13:25:14.644] [info] > Simplex reads filtered: 23 [2024-04-19 13:25:14.644] [info] > Basecalled @ Samples/s: 7.922124e+04 [2024-04-19 13:25:14.644] [info] > 9100 reads demuxed @ classifications/s: 6.132307e+00 [2024-04-19 13:25:14.665] [info] > Finished

With 6mA modification, 0.6.0 is ~54x slower than 0.5.3

0.5.3 (5mC_5hmC, 6mA, batch size 704)

C:\dorado_0.5.3\bin\dorado.exe basecaller sup,6mA,5mC_5hmC --no-trim --kit-name SQK-RBK114-24 --recursive --batchsize 704 .\pod5_pass >.\batch_0.5.3_6mA_5mC.bam [2024-04-19 13:28:17.593] [info] > Simplex reads basecalled: 8940 [2024-04-19 13:28:17.593] [info] > Simplex reads filtered: 9 [2024-04-19 13:28:17.593] [info] > Basecalled @ Samples/s: 3.546153e+06 [2024-04-19 13:28:17.593] [info] > 8940 reads demuxed @ classifications/s: 2.222996e+02 [2024-04-19 13:28:17.632] [info] > Finished

0.6.0 (5mC_5hmC, 6mA, batch size 704)

C:\dorado\bin\dorado.exe basecaller sup,6mA,5mC_5hmC --no-trim --kit-name SQK-RBK114-24 --recursive --batchsize 704 .\pod5_pass > .\batch_0.6_6mA_5mC.bam [2024-04-19 14:19:20.090] [info] > Simplex reads basecalled: 8926 [2024-04-19 14:19:20.090] [info] > Simplex reads filtered: 23 [2024-04-19 14:19:20.090] [info] > Basecalled @ Samples/s: 4.071408e+04 [2024-04-19 14:19:20.090] [info] > 9100 reads demuxed @ classifications/s: 3.151570e+00 [2024-04-19 14:19:20.115] [info] > Finished

Again, with both 5mC_5hmC & 6mA modification, 0.6.0 is ~87x slower than 0.5.3

Hi @samuelmontgomery - thank you for posting the response! I can see that your dataset is pretty small - only runs for about 26s. In our experience that small a dataset can show quite a bit of variability from run to run. Do you have a longer one to test with, say around 50k reads?

I'm also curious to see what batch size the auto batch size was determining and whether that was different between 0.5.3 and 0.6.0 (I don't see it in the logs).

thanks @tijyojwad - will run a dataset of ~50k reads with some of these tests

auto batch size for 0.5.3 is 832, auto batch size for 0.6.0 is:

0.5.3 sup only - batch size 704

C:\dorado_0.5.3\bin\dorado.exe basecaller sup --no-trim --kit-name SQK-RBK114-24 --recursive --batchsize 704 .\pod5_pass > .\batch_0.5.3_sup.bam

[2024-04-22 12:52:20.101] [info] > Simplex reads basecalled: 39366 [2024-04-22 12:52:20.101] [info] > Basecalled @ Samples/s: 9.669210e+06 [2024-04-22 12:52:20.101] [info] > 39366 reads demuxed @ classifications/s: 1.069586e+02 [2024-04-22 12:52:20.134] [info] > Finished

0.6.0 sup only - batch size 704 C:\dorado\bin\dorado.exe basecaller sup --no-trim --kit-name SQK-RBK114-24 --recursive --batchsize 704 .\pod5_pass > .\batch_0.6.0_sup.bam

[2024-04-22 12:59:14.949] [info] > Simplex reads basecalled: 39362 [2024-04-22 12:59:14.950] [info] > Simplex reads filtered: 4 [2024-04-22 12:59:14.950] [info] > Basecalled @ Samples/s: 9.066649e+06 [2024-04-22 12:59:14.950] [info] > 39722 reads demuxed @ classifications/s: 1.253447e+02 [2024-04-22 12:59:14.972] [info] > Finished

0.5.3 sup only - batch size auto (batch size 896) C:\dorado_0.5.3\bin\dorado.exe basecaller sup --no-trim --kit-name SQK-RBK114-24 --recursive .\pod5_pass > .\batch_0.5.3_sup_auto.bam

[2024-04-22 13:06:01.835] [info] > Simplex reads basecalled: 39366 [2024-04-22 13:06:01.835] [info] > Basecalled @ Samples/s: 1.056054e+07 [2024-04-22 13:06:01.835] [info] > 39366 reads demuxed @ classifications/s: 1.168183e+02 [2024-04-22 13:06:01.882] [info] > Finished

0.6.0 sup only - batch size auto (896, 1792) C:\dorado\bin\dorado.exe basecaller sup --no-trim --kit-name SQK-RBK114-24 --recursive .\pod5_pass > .\batch_0.6.0_sup_auto.bam

[2024-04-22 13:11:48.247] [info] > Simplex reads basecalled: 39362 [2024-04-22 13:11:48.247] [info] > Simplex reads filtered: 4 [2024-04-22 13:11:48.247] [info] > Basecalled @ Samples/s: 1.039962e+07 [2024-04-22 13:11:48.248] [info] > 39725 reads demuxed @ classifications/s: 1.437837e+02 [2024-04-22 13:11:48.273] [info] > Finished

0.5.3 sup, 5mC_5hmC - batch size auto (896)

C:\dorado_0.5.3\bin\dorado.exe basecaller sup,5mC_5hmC --no-trim --kit-name SQK-RBK114-24 --recursive .\pod5_pass\ > .\batch_0.5.3_5mC_auto.bam

[2024-04-22 13:37:27.759] [info] > Simplex reads basecalled: 39366 [2024-04-22 13:37:27.759] [info] > Basecalled @ Samples/s: 3.430220e+06 [2024-04-22 13:37:27.759] [info] > 39366 reads demuxed @ classifications/s: 3.794430e+01 [2024-04-22 13:37:27.820] [info] > Finished

0.5.3 sup, 5mC_5hmC - batch size 704

C:\dorado_0.5.3\bin\dorado.exe basecaller sup,5mC_5hmC --no-trim --kit-name SQK-RBK114-24 --recursive --batchsize 704 .\pod5_pass\ > .\batch_0.5.3_5mC.bam [====================] 100% [10m:37s<00m:00s] Basecalling [2024-04-22 14:17:48.206] [info] > Simplex reads basecalled: 39366 [2024-04-22 14:17:48.206] [info] > Basecalled @ Samples/s: 5.569421e+06 [2024-04-22 14:17:48.206] [info] > 39366 reads demuxed @ classifications/s: 6.160766e+01 [2024-04-22 14:17:48.279] [info] > Finished

0.6.0 sup, 5mC_5hmC - batch size auto (896, 1024)

C:\dorado\bin\dorado.exe basecaller sup,5mC_5hmC --no-trim --kit-name SQK-RBK114-24 --recursive .\pod5_pass\ > .\batch_0.6.0_5mC_auto.bam [2024-04-22 14:26:37.269] [info] > Creating basecall pipeline [2024-04-22 14:26:49.450] [info] cuda:0 using chunk size 9996, batch size 896 [2024-04-22 14:26:50.342] [info] cuda:0 using chunk size 4998, batch size 1024 [> ] 2% [12m:23s<10h:06m:47s] Basecalling

0.6.0 sup,5mC_5hmC is only 2% completed in the time it took to complete basecalling with 0.5.3 sup,5mC_5hmC, but was equivalent with just canonical basecalling

Looking at 6mA as this is where I previously found performance was reduced from 6mA@v1 to 6mA@v2 as well (in #660)

6mA 0.5.3 batch size auto

C:\dorado_0.5.3\bin\dorado.exe basecaller sup,6mA --no-trim --kit-name SQK-RBK114-24 --recursive .\pod5_pass\ > .\batch_0.5.3_6mA_auto.bam [2024-04-22 15:51:10.710] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib [2024-04-22 15:51:13.240] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2 with httplib [2024-04-22 15:51:14.246] [info] > Creating basecall pipeline [2024-04-22 15:51:20.298] [info] - set batch size for cuda:0 to 832 [2024-04-22 15:51:20.305] [info] Barcode for SQK-RBK114-24 => ] 9% [10m:00s<01h:41m:10s]

Only 9% completed after 10 minutes using 0.5.3 with auto batch size

6mA 0.5.3 batch size 704

C:\dorado_0.5.3\bin\dorado.exe basecaller sup,6mA --no-trim --kit-name SQK-RBK114-24 --recursive --batchsize 704 .\pod5_pass\ > .\batch_0.5.3_6mA.bam [2024-04-22 16:02:11.437] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib [2024-04-22 16:02:14.547] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2 with httplib [2024-04-22 16:02:15.089] [info] > Creating basecall pipeline [2024-04-22 16:02:18.238] [info] Barcode for SQK-RBK114-24 [==============================] 100% [10m:55s<00m:00s] [2024-04-22 16:13:15.040] [info] > Simplex reads basecalled: 39366 [2024-04-22 16:13:15.040] [info] > Basecalled @ Samples/s: 5.418454e+06 [2024-04-22 16:13:15.040] [info] > 39366 reads demuxed @ classifications/s: 5.993770e+01 [2024-04-22 16:13:15.104] [info] > Finished

basecalling sup,6mA with batch size 704 reduces the basecalling time ~10x compared to auto batch size (832), samples/s equivalent to 5mC_5hmC model with batch size set to 704

6mA 0.6.0 batch size auto

C:\dorado\bin\dorado.exe basecaller sup,6mA --no-trim --kit-name SQK-RBK114-24 --recursive .\pod5_pass\ > .\batch_0.6.0_6mA_auto.bam [2024-04-22 16:14:47.238] [info] > Creating basecall pipeline [2024-04-22 16:14:59.101] [info] cuda:0 using chunk size 9996, batch size 896 [2024-04-22 16:14:59.903] [info] cuda:0 using chunk size 4998, batch size 1024 [=> ] 5% [17m:16s<05h:28m:12s] Basecalling

6mA 0.6.0 batch size 704

[2024-04-22 16:33:22.810] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib [2024-04-22 16:33:25.196] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2 with httplib [2024-04-22 16:33:25.651] [info] > Creating basecall pipeline [2024-04-22 16:33:27.937] [info] cuda:0 using chunk size 9996, batch size 704 [2024-04-22 16:33:28.668] [info] cuda:0 using chunk size 4998, batch size 704 [========> ] 27% [01h:15m:52s<03h:25m:08s] Basecalling

Hi,

Just following up now that the new models (sup@v5.0.0) have been released with dorado v0.7.0

Using v0.7.0 with the sup@v5.0.0 model, dorado is basecalling at ~33% of the samples/s compared to both v0.5.3 and v0.6.0 with sup@v4.3.0 models

Additionally, with the new models, the chunk size has changed compared to sup@v4.3.0 - will that result in different basecalled data compared to the older ones? I know chunk size is important for reproducibility of basecalling

sup@v5.0.0

[2024-05-22 13:10:10.008] [info] cuda:0 using chunk size 12288, batch size 128 [2024-05-22 13:10:10.278] [info] cuda:0 using chunk size 6144, batch size 128

sup@v4.3.0

[2024-05-22 16:18:06.989] [info] cuda:0 using chunk size 9996, batch size 896 [2024-05-22 16:18:07.787] [info] cuda:0 using chunk size 4998, batch size 1024

Some benchmarking below

0.5.3 sup@v4.3.0 auto batchsize

C:\dorado_0.5.3\bin\dorado.exe basecaller sup --no-trim --kit-name SQK-NBD114-96 --recursive .\pod5_pass\barcode01 > .\basecalling_test\barcode01_batch5.3.bam

[2024-05-22 15:01:33.838] [info] > Simplex reads basecalled: 45738 [2024-05-22 15:01:33.838] [info] > Simplex reads filtered: 6 [2024-05-22 15:01:33.838] [info] > Basecalled @ Samples/s: 1.074588e+07 [2024-05-22 15:01:33.838] [info] > 45883 reads demuxed @ classifications/s: 1.317360e+02 [2024-05-22 15:01:33.882] [info] > Finished

0.7.0 sup@v5.0.0 auto batchsize

C:\dorado_0.7.0\bin\dorado.exe basecaller sup --no-trim --kit-name SQK-NBD114-96 --recursive .\pod5_pass\barcode01 > .\basecalling_test\barcode01.bam [2024-05-22 13:10:00.651] [info] > Creating basecall pipeline [2024-05-22 13:10:10.008] [info] cuda:0 using chunk size 12288, batch size 128 [2024-05-22 13:10:10.278] [info] cuda:0 using chunk size 6144, batch size 128 [==============================] 100% [16m:46s<00m:00s] Sorting output files [2024-05-22 10:07:39.284] [info] > Simplex reads basecalled: 45738 [2024-05-22 10:07:39.284] [info] > Basecalled @ Samples/s: 3.715079e+06 [2024-05-22 10:07:39.284] [info] > 45882 reads demuxed @ classifications/s: 4.554293e+01 [2024-05-22 10:07:39.352] [info] > Finished

0.7.0 sup@v4.3.0 auto batchsize

C:\dorado_0.7.0\bin\dorado.exe basecaller sup@v4.3.0 --no-trim --kit-name SQK-NBD114-96 --recursive .\pod5_pass\barcode01 > .\basecalling_test\barcode01_batch4.3.bam [2024-05-22 15:45:55.210] [info] > Creating basecall pipeline [2024-05-22 16:18:06.989] [info] cuda:0 using chunk size 9996, batch size 896 [2024-05-22 16:18:07.787] [info] cuda:0 using chunk size 4998, batch size 1024 [==============================] 100% [05m:59s<00m:00s] Sorting output files [2024-05-22 16:13:30.982] [info] > Simplex reads basecalled: 45738 [2024-05-22 16:13:30.982] [info] > Simplex reads filtered: 6 [2024-05-22 16:13:30.983] [info] > Basecalled @ Samples/s: 1.036893e+07 [2024-05-22 16:13:30.983] [info] > 45883 reads demuxed @ classifications/s: 1.271149e+02 [2024-05-22 16:13:31.004] [info] > Finished

As you can see, the sup@v5.0.0 model runs at 1/3 the speed of the sup@v4.3.0 models regardless of dorado version I will test with modifications as well, as maybe the new transformer models make up time there

There is only a ~25% drop in speed when running the 4mC_5mC & 6mA modified models together compared to canonical alone - meaning while there is a fairly large drop in canonical basecalling speed when using sup@v5.0.0 compared to sup@v4.3.0, the modified basecalling runs at a similar speed

C:\dorado_0.7.0\bin\dorado.exe basecaller sup,4mC_5mC --no-trim --kit-name SQK-NBD114-96 --recursive .\pod5_pass\barcode01 > .\barcode01_4mC.bam

[2024-05-23 13:19:10.755] [info] cuda:0 using chunk size 12288, batch size 128 [2024-05-23 13:19:11.050] [info] cuda:0 using chunk size 6144, batch size 128 [2024-05-22 16:52:22.147] [info] > Simplex reads basecalled: 45738 [2024-05-22 16:52:22.147] [info] > Basecalled @ Samples/s: 2.778261e+06 [2024-05-22 16:52:22.147] [info] > 45882 reads demuxed @ classifications/s: 3.405854e+01 [2024-05-22 16:52:22.208] [info] > Finished

C:\dorado_0.7.0\bin\dorado.exe basecaller sup,4mC_5mC,6mA --no-trim --kit-name SQK-NBD114-96 --recursive .\pod5_pass\barcode01 > .\barcode01_4mC_6mA.bam

[2024-05-23 13:19:10.755] [info] cuda:0 using chunk size 12288, batch size 128 [2024-05-23 13:19:11.050] [info] cuda:0 using chunk size 6144, batch size 128 [==============================] 100% [27m:33s<00m:00s] Sorting output files [2024-05-23 11:58:58.339] [info] > Simplex reads basecalled: 45738 [2024-05-23 11:58:58.339] [info] > Basecalled @ Samples/s: 2.262337e+06 [2024-05-23 11:58:58.339] [info] > 45882 reads demuxed @ classifications/s: 2.773385e+01 [2024-05-23 11:58:58.415] [info] > Finished

C:\dorado_0.7.0\bin\dorado.exe basecaller sup,6mA --no-trim --kit-name SQK-NBD114-96 --recursive .\pod5_pass\barcode01 > .\barcode01_6mA.bam

[2024-05-23 13:19:10.755] [info] cuda:0 using chunk size 12288, batch size 128 [2024-05-23 13:19:11.050] [info] cuda:0 using chunk size 6144, batch size 128 [==============================] 100% [23m:35s<00m:00s] Sorting output files [==============================] 100% [23m:35s<00m:00s] Sorting output files [2024-05-23 13:42:48.060] [info] > Simplex reads basecalled: 45738 [2024-05-23 13:42:48.060] [info] > Basecalled @ Samples/s: 2.643497e+06 [2024-05-23 13:42:48.060] [info] > 45882 reads demuxed @ classifications/s: 3.240648e+01 [2024-05-23 13:42:48.136] [info] > Finished

Hi @samuelmontgomery, the sup@v5.0.0 models a use a new model architecture which we are still optimising. You can expect to see improvements in sup v5 basecalling speed in future dorado releases.

Thanks @HalfPhoton - I will close this as I was more related to modified basecalling and I think 0.7.0 has solved the massive slowdown I was seeing with 0.6.0 & 0.6.1, and it is now comparable to the speed of 0.5.3 with obvious benefits!

nanoporetech / dorado

Drastically reduced basecalling performance with Dorado using RTX4090 #724