samtools / samtools

Tools (written in C using htslib) for manipulating next-generation sequencing data
http://htslib.org/
Other
1.61k stars 576 forks source link

Unexpected interaction of samtools mpileup -r option with read-pair overlap detection #986

Open wm75 opened 5 years ago

wm75 commented 5 years ago

I've observed in several cases that changing the region (-r option of smatools mpileup) in which a pileup is generated affects the depth reported at a fixed position. One example from real data:

$samtools mpileup -B -r chr1:248567537-248567541 -f hg19.fa mapped_data.bam

[mpileup] 1 samples in 1 input files
chr1    248567537   G   61  ..,,,..,,...,,,,.,,,..,........,,,,...,,,,,,.,-2ac,....,,,.,,.,,.   ?moJ=dmirlUmJssHsrssqqqssqrmoi@is?F?mHICIFjFmHFkJiIJDFJIFIDFC
chr1    248567538   A   62  ..,,,..,,...,,$,$,,.,,,..,......$.$.,,$,$,...,,,,,,.*,....,,,.,,.,,.    ?mqI=fmgrk`DJqIsIssrsqrqsroqmliBis9HCmIHCJEjFoHFlIgJJEFJICEEFG
chr1    248567539   C   58  ..g,,..,g...,ggGg,g..g.....,$g$,G..,g,,,,,$G*,G..G.g,,Gg,.,,.   AmqIEim>ri`DIsJssspsspIJop@iFG>JFID@JFjFoHFiIJcJJFFJJACEFG
chr1    248567540   A   57  ..,,,..,,...,,,.,,,..,......,.....,,,,$,.,,.....,,,.,,.,,.  @mnJEamhkmWDJsJpsrosroJJlrB;H3jDJGIBJFglHFiJJ_JJFHIJ?GEFG
chr1    248567541   G   54  .,,,..,$,...,,,.,,,..,.....,.....,,,,.,,.....,,,.,,.,,. moIEflimmDDJsJprrqsroJJorDH<iDJHJDJklHFfJIaJJFHJI>;EFG

now increase the width of the region by one base on each side:

$samtools mpileup -B -r chr1:248567536-248567542 -f hg19.fa mapped_data.bam

[mpileup] 1 samples in 1 input files
chr1    248567536   A   63  ..,,,..,,...,,,,..$,,,..,........,,,,....,,,,,,,$.,,...,,,.,,.,,.   7mmJ8gmjrj]mJssIsCHssrrqrsqsmnh@is?F@mJIJCIFjFooHFmJhJEFJJ<JDEI
chr1    248567537   G   62  ..,,,..,,...,,,,.,,,..,........,,,,....,,,,,,.,-2ac,....,,,.,,.,,.  ?moJ=dmirlUmJssHsIssqqqssqrmoi@is?F?mJHICIFjFmHFkJiIJDFJIFIDFC
chr1    248567538   A   63  ..,,,..,,...,,$,$,,.,,,..,......$.$.,,$,$,....,,,,,,.*,....,,,.,,.,,.   ?mqI=fmgrk`DJqIsIsJrsqrqsroqmliBis9HCmJIHCJEjFoHFlIgJJEFJICEEFG
chr1    248567539   C   59  ..g,,..,g...,ggGg,g..g.....,$g$,G.G.,g,,,,,$G*,G..G.g,,Gg,.,,.  AmqIEim>ri`DIsJsJspsspIJop@iFG>JJFID@JFjFoHFiIJcJJFFJJACEFG
chr1    248567540   A   58  ..,,,..,,...,,,.,,,..,......,......,,,,$,.,,.....,,,.,,.,,. @mnJEamhkmWDJsJpJrosroJJlrB;H3jDJJGIBJFglHFiJJ_JJFHIJ?GEFG
chr1    248567541   G   55  .,,,..,$,...,,,.,,,..,.....,......,,,,.,,.....,,,.,,.,,.    moIEflimmDDJsJpIrqsroJJorDH<iDJJHJDJklHFfJIaJJFHJI>;EFG
chr1    248567542   C   58  ..$,,,,.$.,,...,,,.,,,..,.....,......,,,,.,,......,,,.,,.,,.    =@JnJG<mDpmDDJsJqJqqssoJJnpDHAiDJJ<I9JioFHAiIJgJJFHJI>FEFH

and observe how an additional read gets included in the pileup at positions 37-41.

Playing around with the different samtools mpileup options to see what's causing these discrepancies I found that they disappear when you disable read-pair overlap detection with the -x option, e.g. same calls as above, but with overlap detection disabled (this particular dataset has lots of overlaps):

$samtools mpileup -B -x -r chr1:248567537-248567541 -f hg19.fa mapped_data.bam

[mpileup] 1 samples in 1 input files
chr1    248567537   G   93  ..,,,,..,,,,...,,,,,,.,.,,......,........,,,,,,,,,,......,,,,,,,,,..,,-2ac,,.......,,,,,,,.,,.,,.   ?DJHJ=DDFAJJDDDIJJJJHJIIJJJJJHJHJJJJJDJD@HIJFJF?FJF?FDJJHIC2HIFHFFGEJHFJDCJJCHIJGJIDGFJIFIDFC
chr1    248567538   A   93  ..,,,,..,,,,...,,,$,$,,.,.,,......,......$.$.,,,$,,$,,$,,,$......,,,,,,,,,..,$*,,$.......,,,,,,,.,,.,,. ?DJHI=CDDDJJDDDHJIIJIJJIIJJJJJJJJJJJJDIDBFHJFJF9HIDCFDJJIHC=HJEHFFIEJHFIDDIIDHJJGJIEDFJICEEFG
chr1    248567539   C   82  ..,g,,..,,g,...,,ggGgG,g.G.GG.g.....,,,$g$,,GG..G.,g,,,,,,,$G.*,GG..GG.gggg,g,Gg,.,,.   ADJHIEDDFFJJDDDFIJJJJIJJJJJJGJJIJJJ@FGFFGJ>EDJJFID=HJFJFFGEHFDCIJCGJJIJGFAFJJACEFG
chr1    248567540   A   80  ..,,,,..,,,,...,,,,.,.,,......,......,,,,......,,,$,,,$,,..,,.......,,,,,,,.,,.,,.  @DJFJE<DFFIJDDDJJJJIJCJJJJIIFIJJJIIB;DJHJ3EDJJGIB4FJFJFEBHFCAJJ9FJJHHIFGHIJ?GEFG
chr1    248567541   G   76  .,,,,..,$,,,...,,,,.,.,,......,.....,,,,......,,,,,,..,,.......,,,,,,,.,,.,,.   DJFIECDFDGIDDDJJJJJIGJJJJIJHIJJJJIDFJHJ<DDJJHJDFJJHHDHFDBJI@FJJEGEFBHJI>;EFG

and $samtools mpileup -B -x -r chr1:248567536-248567542 -f hg19.fa mapped_data.bam

[mpileup] 1 samples in 1 input files
chr1    248567536   A   95  ..,,,,..,,,,...,,,,,,..$,.,,......,........,,,,,,,,,,......,,,,,,,,,,$..,,,,.......,,,,,,,.,,.,,.   7DJHJ8DDHDJJDDDGJJJJIJCHIJJIJJFJIJIJJJDIC@HJJFJF?FJF@FDJJIJC:HIFJFFFGEJHFJCDJJAHJJIJJEHFJJ<JDEI
chr1    248567537   G   93  ..,,,,..,,,,...,,,,,,.,.,,......,........,,,,,,,,,,......,,,,,,,,,..,,-2ac,,.......,,,,,,,.,,.,,.   ?DJHJ=DDFAJJDDDIJJJJHJIIJJJJJHJHJJJJJDJD@HIJFJF?FJF?FDJJHIC2HIFHFFGEJHFJDCJJCHIJGJIDGFJIFIDFC
chr1    248567538   A   93  ..,,,,..,,,,...,,,$,$,,.,.,,......,......$.$.,,,$,,$,,$,,,$......,,,,,,,,,..,$*,,$.......,,,,,,,.,,.,,. ?DJHI=CDDDJJDDDHJIIJIJJIIJJJJJJJJJJJJDIDBFHJFJF9HIDCFDJJIHC=HJEHFFIEJHFIDDIIDHJJGJIEDFJICEEFG
chr1    248567539   C   82  ..,g,,..,,g,...,,ggGgG,g.G.GG.g.....,,,$g$,,GG..G.,g,,,,,,,$G.*,GG..GG.gggg,g,Gg,.,,.   ADJHIEDDFFJJDDDFIJJJJIJJJJJJGJJIJJJ@FGFFGJ>EDJJFID=HJFJFFGEHFDCIJCGJJIJGFAFJJACEFG
chr1    248567540   A   80  ..,,,,..,,,,...,,,,.,.,,......,......,,,,......,,,$,,,$,,..,,.......,,,,,,,.,,.,,.  @DJFJE<DFFIJDDDJJJJIJCJJJJIIFIJJJIIB;DJHJ3EDJJGIB4FJFJFEBHFCAJJ9FJJHHIFGHIJ?GEFG
chr1    248567541   G   76  .,,,,..,$,,,...,,,,.,.,,......,.....,,,,......,,,,,,..,,.......,,,,,,,.,,.,,.   DJFIECDFDGIDDDJJJJJIGJJJJIJHIJJJJIDFJHJ<DDJJHJDFJJHHDHFDBJI@FJJEGEFBHJI>;EFG
chr1    248567542   C   76  ..$,,,,.$.,,,...,,,,.,.,,......,.....,,,,......,,,,,,..,,.......,,,,,,,.,,.,,.  =@JFJG<DDJJDDDJJJJJJGIIJJIIIJJJJIHDFIHJADDJJ<I9FJJHGBFHABIJAFJJIHHFGHJI>FEFH

My interpretation of this behavior (not having studied the actual code) is that the overlap detection does not rely on the entire read-pairs, but somehow only on the part of them, which falls inside the requested region?

So first of all, I would like to understand what exactly the interaction between the -r and -x options is. Secondly, at least intuitively, this looks like a bug in samtools mpileup to me (after all, you wouldn't expect the depth reported for any particular base to be dependent on the window through which you look at it).

valeriuo commented 5 years ago

@wm75 I am having difficulties reproducing this behaviour. Could you provide a relevant anonymized subsample of your data?

wm75 commented 5 years ago

Sure, no problem. The data in mapped_data.zip is the relevant (for producing the reported output) snippet (chr1:248567000-248568000) from publicly available test data from the Texas Cancer Research Biobank (sample TCRBOA6-N). There is a rather extreme extent of read overlaps in this dataset so possibly it's only then that the issue becomes noticeable. Thanks for looking into this and let me know if there's anything else I can do to help you.

valeriuo commented 5 years ago

@wm75 thank you for the data! We would have struggled long and hard to reproduce the issue without it.

First, when you specify a mpileup window (with -r), reads that do not overlap it are discarded from the mpileup computation, which should not affect the output, as those reads had no influence over the selected region. Second, by default, mpileup discards the overlapping segment of pair-end reads. You can suppress this behaviour using -x flag.

Now, focusing on 3 particular reads from this BAM file:

HWI-ST896:350:D23CTACXX:7:2113:14780:50737      2115    chr1    248567484       7       13M6D34M        =       248567503       19
HWI-ST896:350:D23CTACXX:7:2113:14780:50737      147     chr1    248567484       22      25S13M6D44M16S  =       248567503       -44
HWI-ST896:350:D23CTACXX:7:2113:14780:50737      99      chr1    248567503       60      3S44M51S        =       248567484       44

we can see the first is a supplementary read (flag 2048 set), which also happens to end at coordinate 248567536. The other two are pair-end reads.

When you specified the mpileup region as chr1:248567536-248567542, the first read was included in the computation and also in the overlap removal logic, because it was seen as part of the same template as the other two reads (they all share the same template name). This was affecting how reads were paired and which overlap segment was being removed. With the -x flag on, the overlap removal was not being performed, so the supplementary read was not affecting the results.

On the other hand, when the selected region was chr1:248567537-248567541, the first read was discarded from the beginning, and the overlap removal algorithm only used the pair-end reads (2 and 3), as it was designed. A fix will follow soon.

wm75 commented 5 years ago

Thanks for the investigation and the detailed explanation @valeriuo ! Glad that I could help.