SouvilleL commented 2 months ago

Issue Report

Please describe the issue:

Hello, As suggested by ONT support, i am submitting an issue encountered while using dorado basecaller.

When basecalling a particular dataset, Dorado filter an important number of reads but only when using hac or sup models. Fast Model is unaffected. As i understand, this behavior is not normal and Dorado is not supposed to filter that much reads.

The data come from a run that crashed during sequencing due to a lack of computer ressources. Sequencing was done on a MinION MK1b, on a windows 10 machine.

In General, ~ 90% of reads are filtered (in hac and sup models). Changing Dorado version does not affect the number of filtered reads. The number of filtered reads does not vary between multiple basecallings.

Table of basecalling results for each model:

(As seen in the tail of log message)	Model	Number of reads basecalled
Fast	7 826 867	4141
Hac	480 717	7 346 151
Sup	716 401	7 110 470

Steps to reproduce the issue:

Run the command : dorado basecaller dna_r10.4.1_e8.2_400bps_hac@v4.3.0 ./dataset

Run environment:

Dorado version: 0.6.2 / 0.7.3
Dorado command:
- dorado basecaller dna_r10.4.1_e8.2_400bps_hac@v4.3.0 ./pod5_folder
- dorado basecaller dna_r10.4.1_e8.2_400bps_sup@v4.3.0 ./pod5_folder
- dorado basecaller dna_r10.4.1_e8.2_400bps_fast@v4.3.0 ./pod5_folder
Operating system: Ubuntu 22.04.4 LTS
Hardware (CPUs, Memory, GPUs): Intel® Xeon(R) w9-3495X, RTX A6000 x2 , 512 Go DDR5
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
Source data location (on device or networked drive - NFS, etc.): On device
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
- Flowcell: FLO-MIN114
- Read length: N50 1.21kb
- Number of reads 7.84M reads
- Dataset 8.09Gb, 94.3GB
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue): A data subset of 7GB is available to be shared.

Logs

Head and Tail of logs of dataset's basecalling The complete log is available to be shared but is too big to be uploaded.

Head

[2024-08-21 11:55:33.089] [info] Running: "basecaller" "dna_r10.4.1_e8.2_400bps_hac@v4.3.0" "-vv" "2024A92_ET_2024A93/20240612_1029_MN35911_FAZ25801_b2e80db0/recovered_reads_sample1.1/" [2024-08-21 11:55:33.098] [trace] Model option: 'dna_r10.4.1_e8.2_400bps_hac@v4.3.0' unknown - assuming path [2024-08-21 11:55:33.098] [info] > Creating basecall pipeline [2024-08-21 11:55:33.833] [debug] cuda:0 memory available: 49.66GB [2024-08-21 11:55:33.833] [debug] cuda:0 memory limit 48.66GB [2024-08-21 11:55:33.833] [debug] cuda:0 maximum safe estimated batch size at chunk size 9996 is 7296 [2024-08-21 11:55:33.833] [debug] cuda:0 maximum safe estimated batch size at chunk size 4998 is 14656 [2024-08-21 11:55:33.833] [debug] Auto batchsize cuda:0: testing up to 10240 in steps of 64 [2024-08-21 11:55:33.913] [debug] cuda:1 memory available: 50.36GB [2024-08-21 11:55:33.913] [debug] cuda:1 memory limit 49.36GB

Tail

[2024-08-21 12:00:14.883] [trace] DSN: PORE_ADAPTER strategy 0 splits in read 14970106-032e-40b3-8d78-310333f688bb [2024-08-21 12:00:14.883] [trace] Read 14970106-032e-40b3-8d78-310333f688bb split into 1 subreads [2024-08-21 12:00:14.883] [trace] READ duration: 215 microseconds (ID: 14970106-032e-40b3-8d78-310333f688bb) [2024-08-21 12:00:14.883] [trace] Processing read 1528b19f-8540-4950-a485-0ba43577e9f5; length 1 [2024-08-21 12:00:14.883] [trace] Detected 0 potential pore regions in read d5e9a936-7f02-44d2-9778-2854f2baa9f2 [2024-08-21 12:00:14.883] [trace] Running PORE_ADAPTER [2024-08-21 12:00:14.883] [trace] DSN: PORE_ADAPTER strategy 0 splits in read d5e9a936-7f02-44d2-9778-2854f2baa9f2 [2024-08-21 12:00:14.883] [trace] Read d5e9a936-7f02-44d2-9778-2854f2baa9f2 split into 1 subreads [2024-08-21 12:00:14.883] [trace] READ duration: 273 microseconds (ID: d5e9a936-7f02-44d2-9778-2854f2baa9f2) [2024-08-21 12:00:14.883] [trace] Analyzing signal in read 1528b19f-8540-4950-a485-0ba43577e9f5 [2024-08-21 12:00:14.883] [trace] Detected 0 potential pore regions in read 1528b19f-8540-4950-a485-0ba43577e9f5 [2024-08-21 12:00:14.883] [trace] Running PORE_ADAPTER [2024-08-21 12:00:14.883] [trace] DSN: PORE_ADAPTER strategy 0 splits in read 1528b19f-8540-4950-a485-0ba43577e9f5 [2024-08-21 12:00:14.883] [trace] Read 1528b19f-8540-4950-a485-0ba43577e9f5 split into 1 subreads [2024-08-21 12:00:14.883] [trace] READ duration: 203 microseconds (ID: 1528b19f-8540-4950-a485-0ba43577e9f5) [2024-08-21 12:00:15.046] [info] > Simplex reads basecalled: 31 [2024-08-21 12:00:15.046] [info] > Simplex reads filtered: 595969 [2024-08-21 12:00:15.046] [info] > Basecalled @ Samples/s: 4.812624e+07 [2024-08-21 12:00:15.046] [debug] > Including Padding @ Samples/s: 6.214e+07 (77.45%) [2024-08-21 12:00:15.061] [info] > Finished

HalfPhoton commented 2 months ago

Hi @SouvilleL, Apologies for the delayed reply.

We've seen other issues when basecalling recovered data from failed / interrupted runs and we're looking into this.

However, could you update the initial message to reformat and clarify the values in this table:

Model Reads basecalled Reads filtered Fast 7 826 867 4141 Hac 480 717 7 346 151 Sup 716 401 7 110 470

And the second log message marked Tail has the same information as Head.

Kind regards, Rich

SouvilleL commented 1 month ago

Hi @HalfPhoton , my apologies for the delay. I did some edits to the formatting and the tail log message.

SebastianDall commented 20 hours ago

I have the same issue with the version 5.0.0 models. I have a sample which I have previously basecalled with dorado 0.4.x using the sup model - yielding 3.5 mio reads. Now 94 % is lost when I rebasecall with the sup model. Could you add a flag that skips the filtering entirely?

dorado --version: 0.8.3

Kirk3gaard commented 18 hours ago

Tested this on a small set of 10000 reads with dorado v 0.8.2 and 3 different sup model versions (4.2.0, 4.3.0, and 5.0.0).

Filtering: 0, 456, and 9379 reads

Commands used to run dorado below. Only the version of the sup model changed.

❯ dorado basecaller --device "cuda:0" /data/software/dorado-0.8.2-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v4.2.0/ pod5s/ > calls.bam
[2024-11-19 13:57:52.633] [info] Running: "basecaller" "--device" "cuda:0" "/data/software/dorado-0.8.2-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v4.2.0/" "pod5s/"
[2024-11-19 13:57:52.728] [info] Normalised: chunksize 10000 -> 9996
[2024-11-19 13:57:52.728] [info] Normalised: overlap 500 -> 498
[2024-11-19 13:57:52.728] [info] > Creating basecall pipeline
[2024-11-19 13:57:54.984] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 4090" and model /data/software/dorado-0.8.2-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v4.2.0/. Full benchmarking will run for this device, which may take some time.
[2024-11-19 13:58:06.385] [info] cuda:0 using chunk size 9996, batch size 1088
[2024-11-19 13:58:07.418] [info] cuda:0 using chunk size 4998, batch size 2048
[2024-11-19 13:58:29.926] [info] > Simplex reads basecalled: 10000
[2024-11-19 13:58:29.926] [info] > Basecalled @ Samples/s: 7.957384e+06
[2024-11-19 13:58:29.929] [info] > Finished

❯ dorado basecaller --device "cuda:0" /data/software/dorado-0.8.2-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0/ pod5s/ > calls.bam
[2024-11-19 13:54:41.138] [info] Running: "basecaller" "--device" "cuda:0" "/data/software/dorado-0.8.2-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0/" "pod5s/"
[2024-11-19 13:54:41.224] [info] Normalised: chunksize 10000 -> 9996
[2024-11-19 13:54:41.224] [info] Normalised: overlap 500 -> 498
[2024-11-19 13:54:41.224] [info] > Creating basecall pipeline
[2024-11-19 13:54:43.279] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 4090" and model /data/software/dorado-0.8.2-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0/. Full benchmarking will run for this device, which may take some time.
[2024-11-19 13:54:53.069] [info] cuda:0 using chunk size 9996, batch size 960
[2024-11-19 13:54:54.028] [info] cuda:0 using chunk size 4998, batch size 1920
[2024-11-19 13:55:17.834] [info] > Simplex reads basecalled: 9545
[2024-11-19 13:55:17.834] [info] > Simplex reads filtered: 456
[2024-11-19 13:55:17.834] [info] > Basecalled @ Samples/s: 7.401725e+06
[2024-11-19 13:55:17.838] [info] > Finished

❯ dorado basecaller --device "cuda:0" /data/software/dorado-0.8.2-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0/ pod5s/ > calls.bam
[2024-11-19 13:53:18.393] [info] Running: "basecaller" "--device" "cuda:0" "/data/software/dorado-0.8.2-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0/" "pod5s/"
[2024-11-19 13:53:18.481] [info] > Creating basecall pipeline
[2024-11-19 13:53:23.605] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 4090" and model /data/software/dorado-0.8.2-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0/. Full benchmarking will run for this device, which may take some time.
[2024-11-19 13:53:29.421] [info] cuda:0 using chunk size 12288, batch size 224
[2024-11-19 13:53:30.078] [info] cuda:0 using chunk size 6144, batch size 224
[2024-11-19 13:54:23.312] [info] > Simplex reads basecalled: 621
[2024-11-19 13:54:23.312] [info] > Simplex reads filtered: 9379
[2024-11-19 13:54:23.312] [info] > Basecalled @ Samples/s: 3.193841e+06
[2024-11-19 13:54:23.314] [info] > Finished

vellamike commented 18 hours ago

@Kirk3gaard could you share these reads with us please? We would like to reproduce this issue internally asap.

iiSeymour commented 18 hours ago

@Kirk3gaard is this on RTX 4090? @SebastianDall what GPU did you see this on?

Kirk3gaard commented 18 hours ago

10k read test data from @SebastianDall : https://www.dropbox.com/scl/fi/9f51sxtd5mmxxxpjcg8gm/10k.pod5?rlkey=b6sbf4gzipg3b830yusxra4mf&dl=0

Yes we have seen the issue on RTX4090 and A10.

lerminin commented 17 hours ago

I am also experiencing this issue on a test set of 24000 recovered reads with dorado v0.7.2 running on Tesla V100S PCIe

model	basecalled reads	filtered
sup@v4.2.0	23976	24
sup@v4.30	902	23908
sup@5.0.0	0	24000

>dorado basecaller -r -x "cuda:all" --no-trim --verbose dorado_models/dna_r10.4.1_e8.2_400bps_sup@v4.2.0/ pod5/ > calls.bam
[2024-11-19 07:54:22.275] [info] Running: "basecaller" "-r" "-x" "cuda:all" "--no-trim" "--verbose" "dorado_models/dna_r10.4.1_e8.2_400bps_sup@v4.2.0/" "pod5/"
[2024-11-19 07:54:22.319] [info] Normalised: chunksize 10000 -> 9996
[2024-11-19 07:54:22.319] [info] Normalised: overlap 500 -> 498
[2024-11-19 07:54:22.319] [info] > Creating basecall pipeline
[2024-11-19 07:56:56.619] [info] > Simplex reads basecalled: 23976
[2024-11-19 07:56:56.619] [info] > Simplex reads filtered: 24
[2024-11-19 07:56:56.619] [info] > Basecalled @ Samples/s: 6.513212e+06
[2024-11-19 07:56:56.635] [info] > Finished

>dorado basecaller -r -x "cuda:all" --no-trim --verbose dorado_models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0/ pod5/ > calls.bam
[2024-11-19 07:38:53.450] [info] Running: "basecaller" "-r" "-x" "cuda:all" "--no-trim" "--verbose" "dorado_models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0/" "pod5/"
[2024-11-19 07:38:53.501] [info] Normalised: chunksize 10000 -> 9996
[2024-11-19 07:38:53.501] [info] Normalised: overlap 500 -> 498
[2024-11-19 07:38:53.501] [info] > Creating basecall pipeline
[2024-11-19 07:41:33.760] [info] > Simplex reads basecalled: 902
[2024-11-19 07:41:33.760] [info] > Simplex reads filtered: 23098
[2024-11-19 07:41:33.760] [info] > Basecalled @ Samples/s: 5.525874e+06
[2024-11-19 07:41:33.788] [info] > Finished

>dorado basecaller -r -x "cuda:all" --no-trim --verbose dorado_models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0/ pod5/ > calls.bam
[2024-11-19 07:58:17.100] [info] Running: "basecaller" "-r" "-x" "cuda:all" "--no-trim" "--verbose" "dorado_models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0/" "pod5/"
[2024-11-19 07:58:17.143] [info] > Creating basecall pipeline
[2024-11-19 08:00:44.156] [info] > Simplex reads filtered: 24000
[2024-11-19 08:00:44.156] [info] > Basecalled @ Samples/s: 4.063903e+06
[2024-11-19 08:00:44.159] [info] > Finished

nanoporetech / dorado

Dorado Basecaller filter 90% of reads on Hac or Sup models (but not on Fast model) #1010