Closed joegeorgeson closed 3 months ago
As is so often the case with mpileup problems, the answer typically lies in disabling BAQ!
$ samtools mpileup -f example.fa example.bam|awk '{print length($NF)}'|head
[mpileup] 1 samples in 1 input files
26
2024
6177
6818
7489
7656
7669
7639
7662
7686
$ samtools mpileup -B -f example.fa example.bam|awk '{print length($NF)}'|head
[mpileup] 1 samples in 1 input files
7663
7698
7664
7643
7561
7668
7674
7643
7666
7686
It's setting the quality value first the first base to 0 for many sequences. You could also use -Q0
to show all data irrespective of base quality.
Edit: also using --max-depth 99999
to remove the 8000 depth cap, and disabling overlap removal, gives you the numbers you expect:
$ samtools mpileup -d 99999 -x -Q0 -f example.fa example.bam|awk '{print length($NF)}'|head
[mpileup] 1 samples in 1 input files
17928
18045
18270
18468
18712
18817
19103
19712
20025
20227
That's over-counting as it's counting reads rather than templates, but so was the naive count in your screenshot above.
Thank you, that clarifies most things. So one difference between using -B
and -Q0
is also overlap removal? If I use the below, are the bases lost due to overlap removal or Q>13 or both?
Is there a combination of flags where read starts are counted, overlaps get removed and bases excluded based on quality? Or does that need to happen in follow-on processing.
$ samtools mpileup -d 99999 -x -B -f example.fa example.bam | awk '{print length($NF)}' | head
[mpileup] 1 samples in 1 input files
17622
17852
18028
18191
18210
18575
18868
19363
19736
20019
The -x
option disables the overlap removal (so read-pairs with overlapping bases from READ1 and READ2 are counted twice). BAQ changes quality values, but the minimum quality threshold is still high unless you use -Q. The problem was BAQ was setting a lot to quality zero so the default -Q13
removed a lot of data, but even with -x and -B you'll get some bases still that are low quality and without specifying -Q0
you'll get them included.
(If I recall correctly -x is just setting bases to Q0, so -Q0
is essentially already including -x)
Thank you again, that clears it up! Closing :)
Edit: Actually wanted to add my 2 cents here ...I think the -B
flag is outdated (that paper is from 2011 after all). Sequencing tech/base calling has improved significantly. It might be best to have the default setting -B
disabled, or maybe update the algorithm so it handles newer sequencing tech 'better'.
Hi samtools,
When running
samtools mpileup
I get a much lower number of bases "piled-up" than what is expected. My best guess after investigating is that soft-clipped reads get counted from 1bp downstream of where they should be, and this somehow gets worse when passing a reference. This affects the beginning of transcripts the most, but can be seen throughout. I have attached a small bam file and accompanying reference of the first 20bp of human 18s to demonstrate.Here I would expect 17,928 bases to be reported in the first position (based on IGV inspection, screenshot below), but only 7,663 get called without a reference and 26 with a reference! Can you tell me what's happening?! Is this a bug? Or should I be doing something differently?
4samtools_troubleshoot.tar.gz
(run samtools --version)
Please describe your environment. OS (run uname -sr on Linux/Mac OS or wmic os get Caption, Version on Windows)