nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
482 stars 59 forks source link

No dx:i:-1 tag for simplex reads with duplex offspring in 0.4.0 #401

Closed dustin-cram closed 11 months ago

dustin-cram commented 11 months ago

Hi,

With 0.4.0 I no longer see any SAM records with the tag dx:i:-1. Is there any way to identify these simplex reads with duplex offspring? I would generally prefer to discard these and use only the duplex read.

vellamike commented 11 months ago

Hi @dustin-cram - Thanks for reporting this, we are looking into this issue.

vellamike commented 11 months ago

Hi @dustin-cram - we have identified the source of this issue and resolved it internally. We will release a fix very soon.

dustin-cram commented 11 months ago

Thanks for the quick fix @vellamike. I look forward to the release.

tijyojwad commented 11 months ago

the fix is now available on GitHub master branch. we'll create a new build and release in a couple of days (in case other major issues show up which need to be fixed as well).

In the meantime, you can try out the fix in this release candidate build - https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.4.1-rc1-linux-x64.tar.gz

dustin-cram commented 11 months ago

In the meantime, you can try out the fix in this release candidate build - https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.4.1-rc1-linux-x64.tar.gz

That works for me.

I'll leave the issue open until 0.4.1 is released to make the issue more visible to others.

diego-rt commented 11 months ago

Hi, I've actually been wondering this for a while.

Can you elaborate on what exactly is a simplex read with duplex offspring? And how are they listed in the current 0.4.0 release then? Are they tagged with dx:i:1?

tijyojwad commented 11 months ago

Hi @diego-rt - here are details on the tags - https://github.com/nanoporetech/dorado#duplex

what exactly is a simplex read with duplex offspring

This is a simplex read whose duplex pair was detected in the dataset, so dorado was able to call a duplex read for that pair.

diego-rt commented 11 months ago

Thanks for the info @tijyojwad

So if I understand correctly, this means that for each duplex read with tag dx:i:1, there are two simplex reads with tag dx:i:-1 ?

If so, I would suggest that either the dx:i:-1 reads shouldn't be emitted by default, or it should be clearer documented that not filtering out dx:i:-1 reads will result in essentially 3 reads being emmited for the same DNA molecule.

tijyojwad commented 11 months ago

Hi @diego-rt -

for each duplex read with tag dx:i:1, there are two simplex reads with tag dx:i:-1

yes, in the perfect scenario. you can also derive the parent simplex reads from the duplex read id (read_1;read_2)

Yes certainly we will document the output characteristics more explicitly. From a basecaller perspective it's better to output all the data and mark their source clearly. Then it's up to the downstream tools to determine how to filter/use it.

brunaeus commented 11 months ago

If one would select simplex reads without duplex offsprings using the dx:i:0 and select duplex reads using the dx:i:1 flag, isn't that the same as filtering out reads with the dx:i:-1 ?

Also, was wondering if the identified issue in current 0.4.0 release completely misses out on simplex reads with duplex offsprings or are they somehow incorporated into one of the other two flags?

tijyojwad commented 11 months ago

If one would select simplex reads without duplex offsprings using the dx:i:0 and select duplex reads using the 'dx:i:1flag, isn't that the same as filtering out reads with thedx:i:-1` ?

yes correct

0.4.0 release completely misses out on simplex reads with duplex offsprings or are they somehow incorporated into one of the other two flags

it folds all simplex reads into dx:0 regardless of whether they have duplex offsprings or no. No read data is thrown away, just the tags are incorrect.

brunaeus commented 11 months ago

@tijyojwad thanks for the clarification.

We are processing direct cDNA transcriptomic data and using both dx:i:0 and dx:i:1 will lead to over representation of simplex reads with duplex offsprings. I have been waiting for this release for it's ability to split concatenated reads because we found a great amount of concatenated reads in our dataset after running previous dorado releases. I will try dorado-0.4.1-rc1.

Thanks for the help.

tijyojwad commented 11 months ago

Dorado v0.4.1 was just released with the bug fix.