nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
452 stars 53 forks source link

Duplex reads longer than the simplex parent(s)? #708

Open minefield47 opened 3 months ago

minefield47 commented 3 months ago

Hi,

I was playing around with quantifying the differences between my duplex reads and the simplex parents and I stumbled upon an issue where the duplex read is longer than a simplex parent (in some cases the duplex read is longer than both simplex parents!). From my understanding of duplexing, the maximum length of a duplex read should be equal to the length of the shortest simplex parent. A duplex read should definitely not be longer than either of its simplex parents. This has been found across multiple libraries now and I wanted to know how this could happen? image

The current workflow is: dorado duplex sup dna_r10.4.1_e8.2_400bps_sup@v4.3.0-> dorado trim -> dorado summary

Thank you! Minefield47

ymcki commented 3 months ago

Maybe the parents overlap only at a fraction of themselves?

--------------------------- parent1
                -------------------------------- parent2

------------------------------------------------ duplex
tijyojwad commented 3 months ago

we expect duplex reads to be shorter than both the parents. duplex is only run on the overlapping portion of the 2 reads.

Can you check on the read lengths before running dorado trim?

minefield47 commented 3 months ago

@ymcki That would be the ideal! However, at the opposite end of the spectrum I have reads where the simplex parents are significantly longer than the duplex parent. I have also been told previously by tijyojwad and others that the duplex read should only be of the overlapping region.

image
minefield47 commented 3 months ago

Speaking of, Hi @tijyojwad!!

we expect duplex reads to be shorter than both the parents. duplex is only run on the overlapping portion of the 2 reads.

Ok, that is what I would have expected as well...

Can you check on the read lengths before running dorado trim?

Give me a bit to run dorado summary on those files. It seems like you are thinking dorado trim is finding primer/adapter sequences in some of my simplex parents (that then get removed) but are not getting detected/removed in my duplex sequence?

Will update soon! minefield

tijyojwad commented 3 months ago

yep exactly - since the length differences are short enough I suspect that might be the issue

minefield47 commented 3 months ago

Hi @tijyojwad A couple of things...

1) Untrimmed Vs. Trimmed: While there are significantly less (suggesting what we discussed)...there are still reads of which the template is shorter than the duplex read. However, none of the untrimmed complements are shorter than the duplex...so a small win? image

2) The new problem: Due to how my workflow is setup on my HPC, I opted to simply rerun my workflow on the raw pod5 library (pod5 subset and then run dorado duplex -> dorado summary -> dorado trim -> dorado summary for each channel to generate the untrimmed/trimmed summary without having to write a whole new script to run dorado summary on my untrimmed files. The problem is now the same library produced two different results. The parameters for both inputs should be identical (I only use the A100 GPUs now), with the only change being a run of dorado summary between dorado duplex and dorado trim...thoughts?

image
minefield47 commented 3 months ago

Quick Update: For the previous utilization I was trying to run dorado in parallel across each channel for a couple of reasons. I just rewrote my workflow and ran dorado this morning when no one was using the GPUs to see if just running dorado duplex on the pod5 directory (vs. each individual channel)... I searched the issues/ReadME and I don't see anything about Dorado being unable to perform simultaneously on the same GPU...I was thinking about it some today and I got curious if the architecture could be the cause of this discrepancy? Do I need to worry about other users utilizing the same GPU as me (the cluster is load sharing)when basecalling.

Also, I did a quick comparison to the A100 vs the H100s installed recently. I am going to run it a couple more times to get some more sound stats but it looks like the simplex basecalling is identical between the A100/H100PCIe...but there is a slight difference in duplex basecalling...if you would prefer I can open a new ticket regarding this...

image

Left is the H100, Right is the A100.

tijyojwad commented 3 months ago

there are still reads of which the template is shorter than the duplex rea

This is interesting. There could be a scenario where the template had a deletion and the complement didn't, and duplex was able to recover the deleted sequence. This could plausibly lead to a duplex read being longer than one of the parents. Would you able to align the duplex read to the template and see if that's the case?

dorado duplex isn't deterministic, as the pairing is susceptible to some ordering differences within the pipeline. It's a known issue, and we've made some changes to reduce the source of non determinism but some still exist. We haven't prioritized fully fixing it yet.

In general we strongly suggest have sole access to the gpu when running dorado. it reduces the chances of OOM due to interference from other processes, makes performance more reliable. The outputs should be deterministic for simplex basecalling for A100/H100. But duplex outputs can differ because of slightly pairing differences from run to run.

minefield47 commented 3 months ago

Hi @tijyojwad,

I see. I had the script designed to redo the channel if an OOM error occurred. However, I had a test run finished where no OOM errors were found in the std.err, and aggregating each channel's bam/tsv into one large bam/tsv still produced the random basecalling results. Since there is no error message, I have no way to know what is actually going on there and don't have the time to investigate, I already got approval from the HPC IT dept to utilize exclusive GPU access anyways.

While I am now consistently getting the same number of simplex reads (296,112 from Dorado's std.out regardless of A100 v. H100), the TSV file has either 299,271 (h100) or 299,262 (a100) simplex reads. I repeated this 3 times on each GPU to the same results each time (Dorado std.out: 296,112 simplex / 15,777 or 15,790, H100 TSV: 299,271 simplex/15,777 duplex, A100 TSV: 299,262 simplex/15,790 duplex). Even adding the Dorado's duplex/simplex std.out of the H100 (296,112 simplex + 15,777 duplex = 311,889) does not total rows of simply reading in the TSV to either R or excel (315,048 for h100). If it matters: I currently subset the duplex from the simplex simply by the filename as duplex reads do not have a filename while simplex reads do. I get the right number of duplex reads, the right total for the number of simplex/duplex reads I input (299,271 + 15,777 = 315,048) which is the same number of rows in the TSV.

Edit: I just checked and this is true for the untrimmed tsv as well.