P CIGAR operator causes mpileup to stutter

jkbonfield commented 10 years ago

When given a BAM file using the P operator, mpileup can sometimes skip bases and duplicate others. Additionally there is no way to actually use the P operator properly, to know which bases align with which in an insertion region, but that is a different fault and a fundamental flaw in the output format.

An example:

@SQ SN:c1 LN:10 s0a 0 c1 1 0 10M * 0 0 AACCGCGGTT s0b 0 c1 1 0 10M * 0 0 AACCGCGGTT s0c 0 c1 1 0 10M * 0 0 AACCGCGGTT s1 0 c1 1 0 5M6I5M * 0 0 AACCGGTTAACCGGTT s2 0 c1 1 0 5M1P4I1P5M * 0 0 AACCGTTAACGGTT s3 0 c1 1 0 5M3I3P5M * 0 0 AACCGGTTCGGTT s4 0 c1 1 0 5M3P3I5M * 0 0 AACCGAACCGGTT s5 0 c1 1 0 4M1D2P2I2P1D4M * 0 0 AACCTAGGTT s6 0 c1 1 0 2M3D6I3D2M * 0 0 AAGTTAACTT *

jkb@seq3a[work/samtools...] ./samtools mpileup c1#pad1.bam [mpileup] 1 samples in 1 input files

Set max per-file depth to 8000 c1 1 N 9 ^!A^!A^!A^!A^!A^!A^!A^!A^!A ~~~~~~~~~ c1 2 N 9 AAAAAAAAA-3NNN ~~~~~~~~~ c1 3 N 9 CCCCCCCC\* ~~~~~~~~~ c1 4 N 9 CCCCCCCC-1N\* ~~~~~~~~~ c1 5 N 9 GGGG+6GTTAACG+4TTAAG+3GTTG+3AAC_+2AG_+6TTAACT ~~~~~~~~~ c1 6 N 9 CCCCCCC*\* ~~~~~~~~~ c1 7 N 9 GGGGGGGG\* ~~~~~~~~~ c1 8 N 9 GGGGGGGG\* ~~~~~~~~~ c1 9 N 9 TTTTTTTTT ~~~~~~~~~ c1 10 N 9 T$T$T$T$T$T$T$T$T$ ~~~~~~~~~ Note that position 5 should end _+2TA_+6GTTAAC Uuencoded SAM file to demonstrate the problem. (It's also c1#pad1.sam) ``` begin 644 - M0%-1"5-..F,Q"4Q..C$P"G,P80DP"6,Q"3$),`DQ,$T)*@DP"3`)04%#0T=# M1T=45`DJ"G,P8@DP"6,Q"3$),`DQ,$T)*@DP"3`)04%#0T=#1T=45`DJ"G,P M8PDP"6,Q"3$),`DQ,$T)*@DP"3`)04%#0T=#1T=45`DJ"G,Q"3`)8S$),0DP M"35--DDU30DJ"3`),`E!04-#1T=45$%!0T-'1U14"2H*

jkbonfield commented 10 years ago

Ugh I give up. uuencode broke too. Much hate for web 2.0! I've no idea how to attach raw data here so just email me if you need it, but this happens to already be a file in the test suite so no need for this time.

jmarshall commented 10 years ago

Putting a line containing ``` before and after puts a paragraph into <pre> mode. But with this file already in the repository, that's even better...

jmarshall commented 8 years ago

We may wish to see if more recent P-related PRs have an effect on this…

jkbonfield commented 8 years ago

It still fails. Mpileup and tview demonstrate the problem. Tview shows this:

1               11        21 
NNNNN******NNNNNNNNNNNNN
AACCG      CGGTT
AACCG******CGGTT
AACCG******CGGTT
AACCG******CGGTT
AACCGGTTAACCGGTT
AACCGTTAA**CGGTT
AACCGGTT***CGGTT
AACCGAAC***CGGTT
AACC*AG*****GGTT
AA***TTAACT***TT

That penultimate sequence was s5 with cigar 4M1D2P2I2P1D4M and seq AACCTAGGTT. The insertion TA has become AG instead. This also demonstrates that the P operator still isn't supported by samtools either (so the example in the original SAM paper is still not displayed correctly).

The last sequence is also displayed incorrectly. The insertion is GTTAAC not TTAACT. Basically it's off by one if the previous base was a deletion as, I think, it's incremented the sequence offset too early.

samtools / htslib

P CIGAR operator causes mpileup to stutter #57