Caution about herro/dorado correct tossing out repetitive reads prio to Shasta assembly

jelber2 commented 1 week ago

Since you guys have run Herro or perhaps dorado correct, I thought that perhaps this github issue might be of interest https://github.com/nanoporetech/dorado/issues/851 that during the minimap2 all-versus-all overlapping of raw Nanopore reads, there are likely reads with repetitive minimizers "tossed out" and not included in the downstream correction steps.

Feel free to close or leave as a discussion

kokyriakidis commented 1 week ago

Hi @jelber2!

We are aware of this issue. However, that's not the only problem. We have seen that HERRO overcorrects not just in centromeres or other hard regions. It occasionally will even "correct" reads and move them across haplotypes. This could have disastrous consequences on phasing and detangling.

jelber2 commented 6 days ago

@kokyriakidis Yes, I have seen evidence of HERRO causing reads to switch phases/haplotypes. Qualitatively, it does not seem extreme though, but yes, it happens. Could you expand on what the disastrous consequences could be on phasing and detangling? For example, might one see false SNVs, indels, or structural variants show up in the detangled bubbles- such that a phased-block in one haplotype has these false variants?

See page 45 from https://github.com/jelber2/hapmers/blob/main/hapmers-presen.pdf regarding evidence of phase switching from Herro-corrected reads. That is an older presentation, and I have learned more things from those data sets and summarized them in a manuscript if you are interested and maybe would not mind being cited as a personal communication.

kokyriakidis commented 6 days ago

@jelber2 There are several problems:

Correcting reads belonging to different haplotypes of the same region and moving them to the other haplotype. This leads to loss of the true signal (certain SNPs for e.g.) that would be able us to detangle/phase better that region in the graph. Basically this could lead to collapsed haplotypes in the graph.
Correcting reads that not all of them come from the same region of the genome. For example, many reads on hard region (repetitive etc) might map to the wrong location with minimap2 and therefore HERRO will overcorrect them all together, leading to coverage drops in other locations and loosing the true signal (certain SNPs for e.g.) that would be able us to detangle/phase better that region in the graph.
Due to the masking/filtering of high frequency minimizers on the minimap2 side, we might have coverage drops based on missing alignments (as mentioned in the github issue)

In conclusion, and based on our extensive analysis, HERRO definitelly overcorrects on hard regions.

It is totally fine to be cited as a personal communication :)

jelber2 commented 4 days ago

Thank you very much!

paoloshasta / shasta

Caution about herro/dorado correct tossing out repetitive reads prio to Shasta assembly #38