samtools / htslib

C library for high-throughput sequencing data formats
Other
783 stars 447 forks source link

Improve mpileup overlap removal #1751

Closed jkbonfield closed 4 months ago

jkbonfield commented 4 months ago

The previous version didn't do overlap removal for deletions, meaning we'd get the same read twice where one copy had a deletion.

This is still true for more complex compound cigar strings as 2M1I5D1I5M in the second record. Possibly this can be improved on further[*], but it's a pretty exotic case that is unlikely to cause problems and the behaviour in this PR there is the same as how it used to be (ie it falls back to the old method of just continuing to the next position).

Fixes samtools/samtools#1992

[*] The cause of 5D1I5M failing is because we're on the "M" cigar and the previous is not "D". We could relax that check, but I wasn't confident this would work in all possible scenarios, eg backwards B skips or N ref-skips.