When given a BAM file using the P operator, mpileup can sometimes skip bases and duplicate others. Additionally there is no way to actually use the P operator properly, to know which bases align with which in an insertion region, but that is a different fault and a fundamental flaw in the output format.
Set max per-file depth to 8000
c1 1 N 9 ^!A^!A^!A^!A^!A^!A^!A^!A^!A ~~~~~~~~~
c1 2 N 9 AAAAAAAAA-3NNN ~~~~~~~~~
c1 3 N 9 CCCCCCCC\* ~~~~~~~~~
c1 4 N 9 CCCCCCCC-1N\* ~~~~~~~~~
c1 5 N 9 GGGG+6GTTAACG+4TTAAG+3GTTG+3AAC_+2AG_+6TTAACT ~~~~~~~~~
c1 6 N 9 CCCCCCC*\* ~~~~~~~~~
c1 7 N 9 GGGGGGGG\* ~~~~~~~~~
c1 8 N 9 GGGGGGGG\* ~~~~~~~~~
c1 9 N 9 TTTTTTTTT ~~~~~~~~~
c1 10 N 9 T$T$T$T$T$T$T$T$T$ ~~~~~~~~~
Note that position 5 should end _+2TA_+6GTTAAC
Uuencoded SAM file to demonstrate the problem. (It's also c1#pad1.sam)
```
begin 644 -
M0%-1"5-..F,Q"4Q..C$P"G,P80DP"6,Q"3$),`DQ,$T)*@DP"3`)04%#0T=#
M1T=45`DJ"G,P8@DP"6,Q"3$),`DQ,$T)*@DP"3`)04%#0T=#1T=45`DJ"G,P
M8PDP"6,Q"3$),`DQ,$T)*@DP"3`)04%#0T=#1T=45`DJ"G,Q"3`)8S$),0DP
M"35--DDU30DJ"3`),`E!04-#1T=45$%!0T-'1U14"2H*
Ugh I give up. uuencode broke too. Much hate for web 2.0! I've no idea how to attach raw data here so just email me if you need it, but this happens to already be a file in the test suite so no need for this time.
That penultimate sequence was s5 with cigar 4M1D2P2I2P1D4M and seq AACCTAGGTT. The insertion TA has become AG instead. This also demonstrates that the P operator still isn't supported by samtools either (so the example in the original SAM paper is still not displayed correctly).
The last sequence is also displayed incorrectly. The insertion is GTTAAC not TTAACT. Basically it's off by one if the previous base was a deletion as, I think, it's incremented the sequence offset too early.
When given a BAM file using the P operator, mpileup can sometimes skip bases and duplicate others. Additionally there is no way to actually use the P operator properly, to know which bases align with which in an insertion region, but that is a different fault and a fundamental flaw in the output format.
An example:
@SQ SN:c1 LN:10 s0a 0 c1 1 0 10M * 0 0 AACCGCGGTT s0b 0 c1 1 0 10M * 0 0 AACCGCGGTT s0c 0 c1 1 0 10M * 0 0 AACCGCGGTT s1 0 c1 1 0 5M6I5M * 0 0 AACCGGTTAACCGGTT s2 0 c1 1 0 5M1P4I1P5M * 0 0 AACCGTTAACGGTT s3 0 c1 1 0 5M3I3P5M * 0 0 AACCGGTTCGGTT s4 0 c1 1 0 5M3P3I5M * 0 0 AACCGAACCGGTT s5 0 c1 1 0 4M1D2P2I2P1D4M * 0 0 AACCTAGGTT s6 0 c1 1 0 2M3D6I3D2M * 0 0 AAGTTAACTT *
jkb@seq3a[work/samtools...] ./samtools mpileup c1#pad1.bam [mpileup] 1 samples in 1 input files
Ugh I give up. uuencode broke too. Much hate for web 2.0! I've no idea how to attach raw data here so just email me if you need it, but this happens to already be a file in the test suite so no need for this time.
Putting a line containing ``` before and after puts a paragraph into
<pre>
mode. But with this file already in the repository, that's even better...We may wish to see if more recent
P
-related PRs have an effect on this…It still fails. Mpileup and tview demonstrate the problem. Tview shows this:
That penultimate sequence was s5 with cigar 4M1D2P2I2P1D4M and seq AACCTAGGTT. The insertion TA has become AG instead. This also demonstrates that the P operator still isn't supported by samtools either (so the example in the original SAM paper is still not displayed correctly).
The last sequence is also displayed incorrectly. The insertion is GTTAAC not TTAACT. Basically it's off by one if the previous base was a deletion as, I think, it's incremented the sequence offset too early.