waveygang / wfmash

base-accurate DNA sequence alignments using WFA and mashmap2
MIT License
172 stars 15 forks source link

Don't include endpoint in L2 range #220

Closed bkille closed 6 months ago

bkille commented 6 months ago

Segment mapping start positions are computed as the midpoint of candidate mapping regions. The problem is that for small reference contigs (<2x segment size), the minmers/seed window boundaries are not uniformly distributed; they bunch up near the boundaries of the index. As a result, the start position for mappings which map to the end of these contigs can be offset.

An example of this issue is shown in #218. The contigs are all slightly less than 1kbp. If you set the segment length to 900bp and turn merging off (--no-merge) in the main branch, you'll see that the first split [0,900) maps okay, but the [100, 1000) split will map to ~500. The problem is mitigated by shifting the candidate window one minmer back.

Also, in cases where segments map towards the end of a reference contig, we should truncate the coordinates before doing the length mismatch filter.