marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
653 stars 179 forks source link

Increase of initial identity threshold #1691

Open snurk opened 4 years ago

snurk commented 4 years ago

Currently we are considering overlaps with error-rate up to 1%, which might be an overkill, considering identity distribution of the reads post-compression! Decreasing this threshold might be important for bubble/confusion detection, because we can not easily pose stricter thresholds within individual procedures (for the overlaps, which are expected not to originate from the very same location/haplotype). Indeed, if we ignore a 0.7% overlap of size 7k, while trying to use a 0.5% threshold, we might miss the fact that it actually had a 'suboverlap' of 0.3% of size 5k.

snurk commented 4 years ago

In particular, posing a stricter initial threshold can help filtering out potential placements in bubble contig analysis.

snurk commented 4 years ago

This task includes reconsideration (and likely removal) of all the 'stricter' threshold that we might have experimentally introduced in individual procedures.

snurk commented 4 years ago

Related is possible increase of minOvlLength to 1K

snurk commented 4 years ago

Some overlaps seem to indeed be low quality (due microsatellite repeats). Options seem to be: