terminate called without an active exception

jazsakr commented 10 months ago

I recently updated dorado to the latest version (0.5.0) and I encountered an error I haven't seen in earlier versions. I ran the following command in debug mode:

dorado basecaller ~/packages/dorado/dorado_models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0 ./ --recursive --modified-bases 5mCG_5hmCG --reference ~/references/mm39.fa --secondary no --verbose > unsorted.bam

And I got the following error (last few lines):

2023-12-20 23:43:50.834] [debug] Checking adapter/primer cDNA_VNP
[2023-12-20 23:43:50.834] [debug] Checking adapter/primer cDNA_SSP
[2023-12-20 23:43:50.834] [debug] Checking adapter/primer PCS110_forward
[2023-12-20 23:43:50.834] [debug] Checking adapter/primer PCS110_reverse
terminate called without an active exceptionm:39s<01d:03h:02m:35s]
Aborted (core dumped)

tijyojwad commented 10 months ago

Hi @jazsakr - thanks for reporting. Could you try v0.5.1? We pushed a few bug fixes that might address your problem. Otherwise you can use --no-trim option for now to avoid this issue and we'll look further into it

jazsakr commented 10 months ago

Unfortunately, updating to v0.5.1 did not work. I also tried the --no-trim flag using both versions, still produced the same error. Then I tried to basecall a different sample, with both versions and with and without the flag, same error.

tijyojwad commented 10 months ago

Hmm that's interesting that --no-trim didn't fix it. From the logs it looks like it happens during adapter trimming, but perhaps not? Would you be able to narrow this down to a subset of reads/small pod5 that you can share?

jazsakr commented 10 months ago

We have another machine in our lab (I will call server2) and I wanted to do another test without mapping to rule that out.

On both server1 and server2, I ran dorado basecaller ~/packages/dorado/dorado_models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0 ./ --recursive --modified-bases 5mCG_5hmCG --verbose > unmapped.bam on the same sample I have been trying to basecall (I used v0.5.0 to keep consistency between my other samples from the same experiment that did not abort during basecalling).

Server1 aborted after basecalling 22% of the reads and server2 aborted after basecalling 96% of the reads. I looked at the last read in the bam file for both and each had a different read ID. When server1 aborted sooner, I went ahead and tried the same command (in this comment, not the original post) using v0.5.1 and it also aborted after basecalling 22% at the same read.

I think that certain reads are causing dorado to abort and in this particular sample of mine and I might have more than one read that is causing this. I think this is strange because I have had dorado flag "corrupted" reads but continue basecalling.

The reads I think are causing the abort are 462c557c-7cf6-4f21-9094-70165adf3f42 and 158a7155-4819-410b-9bb7-93f6ccc5cb4b.

The pod5 files with these reads are attach along with the last ten reads from the aborted bam files from the basecalling on server1 and server2 (both dorado versions). dorado_abort_reads.tar.gz

Thank you!

tijyojwad commented 10 months ago

amazing, thank you for sharing the data! I'll have a look at this ASAP and get back to you.

tijyojwad commented 10 months ago

Hi @jazsakr - I ran your exact command with your dataset on 0.5.0 and 0.5.1 and neither ran into any issues.

I have had dorado flag "corrupted" reads but continue basecalling

what do you mean by this? was this a warning from dorado? can you post what was reported?

jazsakr commented 10 months ago

Hi! I meant that in the past, sometimes I would get an error like this: [error] Failed to get read 802 signal: Invalid: Input data failed to decompress using zstd: (18446744073709551596 Data corruption detected). This is what makes me curious as to why it would abort if my data was corrupted.

I think, then, I am not identifying the reads throwing the error correctly. It is a large run with 69.2 Gbases. I keep trying to rerun it after omitting the pod5 file with the last read it aborted on, but it is not working. It will always abort in a few hours.

I tried to basecall a sample (I will call sample3) we sequenced after the samples that are causing the abort (sample1 referred to in my original post on 12/21 and sample2 referred to in my comment on 12/30), and the basecall completed successfully. I think there is something strange about sample1 and sample2.

Any suggestions how to identify the potential problematic reads that I could try to send?

tijyojwad commented 10 months ago

[error] Failed to get read 802 signal: Invalid: Input data failed to decompress using zstd: (18446744073709551596 Data corruption detected)

This usually means some sort of corruption in the pod5 file. This can cause unexpected failures, so I'm not surprised your runs fail. This could also explain why dorado fails around the same point, which is when it tries to process the corrupted file.

Unfortunately we don't log the file paths we're processing - I'll get that added to the next release as that would help narrow issues like this. If you can build Dorado from source, you can add a print statement here https://github.com/nanoporetech/dorado/blob/master/dorado/data_loader/DataLoader.cpp#L351 to see which file is being read before the failure occurs.

tijyojwad commented 10 months ago

Alternatively you can loop over each pod5 file in your directory and run dorado on it. The failing runs will highlight the corrupted files.

jazsakr commented 10 months ago

I see. But this message [error] Failed to get read 802 signal: Invalid: Input data failed to decompress using zstd: (18446744073709551596 Data corruption detected) was months ago from a completely different experiment, sample, even kit. The only error my sample1 and sample2 keeps throwing is terminate called without an active exception and nothing else.

I think I will try to loop!

jazsakr commented 10 months ago

Hi @tijyojwad. Some updates: I did the "loop basecalling" on two of the three samples that kept aborting. Here is some of the output from my script that creates a log file (Note: I filtered out any non-error messages unless it has Finished).

Sample2

Basecalling PAS38658_8b84a00a_15c3d01c_318.pod5
[2024-01-14 03:58:27.271] [error] Failed to get read 7 signal: Invalid: Input data failed to decompress using zstd: (18446744073709551596 Data corruption detected)
[2024-01-14 03:59:34.358] [info] > Finished
---
Basecalling PAS38658_8b84a00a_15c3d01c_550.pod5
terminate called without an active exception
---
Basecalling PAS38658_8b84a00a_15c3d01c_68.pod5
[2024-01-14 18:00:05.829] [error] Failed to get read 869 signal: Invalid: Input data failed to decompress using zstd: (18446744073709551596 Data corruption detected)
[2024-01-14 18:17:07.683] [info] > Finished

Sample4

Basecalling PAS24591_aefca11c_0b3b85eb_425.pod5
[2024-01-14 01:09:35.947] [error] Failed to get read 509 signal: Invalid: Input data failed to decompress using zstd: (18446744073709551596 Data corruption detected)
[2024-01-14 01:10:39.307] [info] > Finished
---
Basecalling PAS24591_aefca11c_0b3b85eb_535.pod5
[2024-01-14 05:42:18.573] [error] Failed to get read 770 signal: Invalid: Remaining data at end of signal buffer
[2024-01-14 05:47:56.043] [info] > Finished
---
Basecalling PAS24591_aefca11c_0b3b85eb_69.pod5
terminate called without an active exception
---
Basecalling PAS24591_aefca11c_0b3b85eb_893.pod5
[2024-01-14 15:04:59.659] [error] Failed to get read 549 signal: Invalid: Remaining data at end of signal buffer
[2024-01-14 15:06:26.522] [info] > Finished

Then I thought to compare the read IDs from some pod5 files to their corresponding bam file as a sanity check and I also wanted to see if I can isolate the reads that caused errors. I looked at PAS38658_8b84a00a_15c3d01c_68.pod5 and found that not only do some read IDs exist unique to the pod5, but there are also read IDs in the bam file that are not in the pod5 file. Then I basecalled a file that finished successfully (PAS38658_8b84a00a_15c3d01c_0.pod5) and observed the same thing. I also observed the same thing when I tried a different dorado version. Why are there read IDs in the bam file that don't exist in its corresponding pod5 file? Are they de-concatenated reads from adapter trimming with a new read IDs? Why are there reads in the pod5 file that don't exist in its corresponding bam file when I did not set a quality threshold? Here are the commands I ran:

dorado basecaller ~/packages/dorado/dorado_models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0 ./pod5/PAS38658_8b84a00a_15c3d01c_0.pod5 --modified-bases 5mCG_5hmCG --verbose > PAS38658_8b84a00a_15c3d01c_0.bam

pod5 view --ids pod5/PAS38658_8b84a00a_15c3d01c_0.pod5 | grep -v read_id | sort > PAS38658_8b84a00a_15c3d01c_0_pod5.txt

samtools view PAS38658_8b84a00a_15c3d01c_0.bam | awk '{print $1}' | sort > PAS38658_8b84a00a_15c3d01c_0_bam.txt

diff PAS38658_8b84a00a_15c3d01c_0_pod5.txt PAS38658_8b84a00a_15c3d01c_0_bam.txt

Thank you so much for your help!

For anyone who sees this post and needs to do the same, I posted the code to "loop" basecall and concatenate the bam files here on GitHub.

tijyojwad commented 10 months ago

Why are there read IDs in the bam file that don't exist in its corresponding pod5 file?

This is mainly due to read splitting. Any split reads will have the parent read id in the pi:Z tag of the BAM record, so you can associate it with the original read id from the pod5.

Why are there reads in the pod5 file that don't exist in its corresponding bam file when I did not set a quality threshold?

If a read is split, then subreads are assigned new read ids. So you won't find the original pod5 read id (and same as before, you can use the pi:Z tag to locate the parent.

There's also default filtering in dorado filters out any reads whose basecalled length is <= 5 bp. But this should only affect very few reads.

jazsakr commented 9 months ago

Thank you for the explanation!

I basecalled one of the files that caused the abort in verbose mode (I took some of the Auto batchsize lines out). Looks like it didn't even basecall one read. Is there anything else I can try to figure out what is going on?

$ ~/packages/dorado/dorado-0.5.1-linux-x64/bin/dorado basecaller ~/packages/dorado/dorado_models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0 ./pod5/PAS24591_aefca11c_0b3b85eb_69.pod5 --modified-bases 5mCG_5hmCG --device "cuda:0" --verbose > PAS24591_aefca11c_0b3b85eb_69.bam
[2024-01-19 21:21:42.509] [debug] - matching modification model found: dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mCG_5hmCG@v1
[2024-01-19 21:21:42.509] [info] > Creating basecall pipeline
[2024-01-19 21:21:43.229] [debug] cuda:0 memory available: 50.12GB
[2024-01-19 21:21:43.229] [debug] Auto batchsize cuda:0: memory limit 49.12GB
[2024-01-19 21:21:43.229] [debug] Auto batchsize cuda:0: testing up to 1984 in steps of 64
[2024-01-19 21:21:43.407] [debug] Auto batchsize cuda:0: 64, time per chunk 1.328845 ms
... 
[2024-01-19 21:22:03.167] [debug] Auto batchsize cuda:0: 1984, time per chunk 0.312267 ms
[2024-01-19 21:22:03.167] [debug] Device cuda:0 Model memory 10.92GB
[2024-01-19 21:22:03.167] [debug] Device cuda:0 Decode memory 4.51GB
[2024-01-19 21:22:03.208] [info]  - set batch size for cuda:0 to 640
[2024-01-19 21:22:03.208] [debug] - adjusted chunk size to match model stride: 10000 -> 9996
[2024-01-19 21:22:03.209] [debug] - adjusted overlap to match model stride: 500 -> 498
terminate called without an active exception
Aborted (core dumped)

tijyojwad commented 9 months ago

I don't see anything odd with the dorado setup, that looks to be fine.

Can you add --max-reads 1 to your file to see if the file is just totally borked, or if this is a read specific issue?

tijyojwad commented 5 months ago

Closing due to inactivity.

nanoporetech / dorado

terminate called without an active exception #548