nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
493 stars 59 forks source link

`dorado correct` discarding reads in repeats #851

Open diego-rt opened 4 months ago

diego-rt commented 4 months ago

Hi,

I wanted to run dorado correct on a set of reads spanning a complex satellite region. This worked fine but after mapping I realised that it discarded all the reads in the most repetitive region. I assume this might have to do with the way you do the all versus all alignments which might involve discarding the most abundant minimisers. This is a big problem because these are obviously the most useful reads.

Thanks a ton!

Screenshot 2024-05-28 at 22 44 03

tijyojwad commented 4 months ago

Hi @diego-rt - thanks for highlighting this. I think you're right, we will look into it.

colindaven commented 3 months ago

I think this is a problem for us despite the nice gains through dorado-correct.

What I see is hugely improved contig N50s (from 37 -> 68 MB N50) for a plant genome of about 700 MB when using dorado corrected reads.

That's great, but the total assembly size is typically around 670 MB using flye or hifiasm with raw 10.4.1 ONT reads. With dorado corrected reads, we only see a total genome size of about 640-644 MB (so about 30 MB or 5% less), indicating that probably repeat rich reads and regions are missing.

bpanda-dev commented 3 months ago

To support @diego-rt point , this issue is due to the All vs All overlap from minimap2. Minimap2 discards reads coming from long repetitive regions. To check this, we mapped the reads present in AvA minimap2 output to the reference genome and analysed regions with low coverage, we find that it does not contain the reads pertaining to repeat regions especially the centromeric satellite regions.

Note: we used HERRO instead of dorado correct , since it has separate scripts for running the three steps (preprocessing, AvA, Herro Inference) of the HERRO correction pipeline.

Given below , we compare the read mapping coverage across the hg002 chromosome 19 MATERNAL reference for the raw read set and the HERRO corrected read set respectively.

chr19_raw sorted_bam2plot_chr19_MATERNAL chr19_herro sorted_bam2plot_chr19_MATERNAL

Regards, Bikram Panda

bdrosen commented 3 months ago

Hi @tijyojwad, is this fixed in the v0.7.2 release or was that a separate alignment issue mentioned in the notes? "https://github.com/nanoporetech/dorado/commit/3b51c1b3c694453d7da04ea91030d7e98b4e9681 - Fix sub-par alignments in dorado correct" Thanks!

ekg commented 2 months ago

We will explore applying wfmash to this. It should behave differently in repeats.

HalfPhoton commented 2 weeks ago

This issue should have been resolved in dorado 0.7.3 and there has also been improvements to the tool's general stability in the newly released dorado 0.8.0.

Closing this issue as resolved but please re-open or create a new issue if it has not been properly addressed.

Kind regards, Rich

diego-rt commented 1 week ago

Hi @HalfPhoton @tijyojwad

I don't think it has been fixed yet. I'm using dorado 0.7.3 and I can confirm it's still a problem.

If you are referring to the fix Remove limit on number of overlaps considered during all-vs-all alignment in dorado correct introduced in dorado 0.7.3, I don't think that this addresses the underlying issue of this problem.

The problem is that minimap2 discards high frequency minimisers (i.e. as in option -f in the minimap2 manual). If you want to be able to correct reads in satellites you are going to have to use an approach like winnowmap, a minimap2 fork that does not discard high frequency minimisers. What @ekg proposed also sounds like a great idea.