Would I use Dorado correct for all reads or simplex reads

hungweichen0327 commented 4 months ago

Dear community,

I would like to know what kinds of input reads for error correction. Should I use both duplex and simplex reads (simplex: including dx:0 and dx:-1)? Or I should use only the simplex reads?

Any comments or suggestions would be appreciated!

vellamike commented 4 months ago

We suggest that you only use dorado correct for simplex reads, it might work with duplex but we haven't yet tested this extensively.

bruzecruise commented 4 months ago

@vellamike Adding a question here, maybe these scenarios were tested for genome assembly?

Would a pipeline with dorado duplex first followed by dorado correct on the simplex non-donor reads make sense? I would guess this is fine if I have enough of the non-donor simplex coverage?

Or is dorado correct best done with all the read data possible and thus dorado duplex is not recommended here?

Or can you just pipe the duplex reads and simplex non-donor reads into dorado correct without worries?

Thanks!

tijyojwad commented 4 months ago

Hi @bruzecruise - you have the right intuition that dorado correct behavior will depend on simplex coverage. So filtering out reads (such as dropping the donor simplex ones) will affect coverage and thereby the effectiveness of correction. We haven't done studies on this to quantify this though. So for now we suggest running dorado correct on the simplex basecalls.

P.S. - dorado correct doesn't work with piping since it generates an index of the read file. You'll ned to store the output of dorado basecaller as a fastq (with --emit-fastq option) and pass that to dorado correct.

hungweichen0327 commented 4 months ago

Dear @vellamike and @tijyojwad,

I would like to confirm/ask:

Q1. What kind of simplex reads did you suggest for read correction? Both dx:0 and dx:-1 or dx:0 only? I think the answer is Both dx:0 and dx:-1?

Q2. Another question is about input read data of the genome assembly. I know many people only used simplex non-donor reads (dx:0) and duplex reads (dx:1) to assemble the genome. However, as discussed in the https://github.com/nanoporetech/dorado/issues/363#issuecomment-1712259340 and https://github.com/nanoporetech/dorado/issues/327#issuecomment-1695818336, (1) Duplex could actually change the bases, not just the quality scores. (The sequence and the length of dx:1 and dx:-1 reads are not the same.) (2) Some longer reads will be removed when filtering out donor reads (dx:-1).

According to my previous experience, if I used all kinds of reads (dx:1, dx:0, dx:-1) to assemble the genome, I could obtain a more continuous assembly. But is this more continuous genome assembly correct? Or will it construct the wrong (fake) genome assembly due to higher read coverage in the overlapped regions covered by the partial sequence of simplex donor reads (dx:-1) and duplex reads (dx:1).

Thank you!

hungweichen0327 commented 4 months ago

Can someone give me some suggestions? Thank you!

tijyojwad commented 4 months ago

Hi @hungweichen0327 we're checking this with our experts and will get back with a recommendation shortly.

biorover commented 4 months ago

Hello @hungweichen0327 , Sorry for the late reply. In short this is a use case we haven't yet validated, but here's my best guess for a correct protocol given the assembler principles.

The main issue is that assemblers need corrected reads one way or another. For duplex reads, we suspect that fewer errors would be introduced with simpler correction algorithms like those built into Hifiasm and Verrko. However, there are also benifits to having higher coverage at correction time, so correcting the duplex and simplex reads separately (duplex with the integrated Hifiasm/Verrko correction and the other with HERRO) may yeild suboptimal results unless the coverage is very high. As such, the most robust route is probably to correct all simplex and duplex reads with HERRO and the proceed with assembly without further correction (or with only one round of correction for Hifiasm as you cannot disable correction entirely for that program).

If you have very high coverages of both, then you could separately correct the simplex with HERRO and the duplex with the integrated read corrector.

Finally, when I say "simplex" here, I mean the reads with no follow-on (dx:i:0). The simplex-called paired reads (dx:i:-1) are fully redundant with the duplex reads (dx:i:1) and will increase run time while providing no additional information. These should be discarded before assembly

hungweichen0327 commented 4 months ago

Hello @biorover, Thank you for the clear reply. According to your suggestion, if I have both high coverages of simplex (dx:0) read and duplex reads, I would correct them separately, meaning that duplex with the integrated Hifiasm/Verrko correction and the simplex (dx:0) with HERRO. If I do not have enough coverage of duplex or simplex reads, I would use both of them for error correction with HERRO. Once again, thank you for the help.

asan-emirsaleh commented 1 week ago

Hello dear community! I have a question regarding the @tijyojwad 's comment:

You'll ned to store the output of dorado basecaller as a fastq (with --emit-fastq option) and pass that to dorado correct.

Should the data I aim to use for dorado correct contain comments, or I can use 'cleaned' fastq files obtained from .bam by using bioawk or samtools fastq? Does the metadata matter?

Best regards Asan

HalfPhoton commented 1 week ago

@asan-emirsaleh,

Dorado correct input data does not need comments and can be 'clean'.

Kind regards, Rich

nanoporetech / dorado

Would I use Dorado correct for all reads or simplex reads #841