samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

Merges neighbouring I and D ops into one op within pileup. #1552

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

This means 4M1D1D1D3M is reported as 4M3D3M instead as far as mpileup is concerned (although samtools view will report as-is) and importantly p->indel=-3 for the first 1D and the 2nd 1D has p->indel=0 (with p->is_del=1, the same as it would for the 2nd base in a 3D cigar op).

Previously samtools mpileup would produce incorrect looking output for the 1D1D scenario. Fixing this in sam.c means not only is samtools mpileup now looking better, but any tool using the mpileup API will be getting consistent results.

Note that samtools mpileup already resolved the ...1I1I1I... case, but it did this within the samools bam_plcmd.c code itself. Hence while the pileup API works, it left p->indel=1 instead of p->indel=3 for this situation. So we also resolve that in a similar fashion.

Note 2P1I1I is reported as p->indel=2 (a 2bp indel) even though bam_plp_insertion would return e.g. +4**AC, as we're reporting the number of bases inserted in this sequence rather than the padded alignment size.

Fixes samtools/samtools#139, or at least the remaining part of the puzzle. Most had previously been fixed already back in 2014.