samtools / samtools

Tools (written in C using htslib) for manipulating next-generation sequencing data
http://htslib.org/
Other
1.62k stars 581 forks source link

mpileup acting funny, possibly due to soft-clipped reads #2082

Closed joegeorgeson closed 3 months ago

joegeorgeson commented 3 months ago

Hi samtools,

When running samtools mpileup I get a much lower number of bases "piled-up" than what is expected. My best guess after investigating is that soft-clipped reads get counted from 1bp downstream of where they should be, and this somehow gets worse when passing a reference. This affects the beginning of transcripts the most, but can be seen throughout. I have attached a small bam file and accompanying reference of the first 20bp of human 18s to demonstrate.

Here I would expect 17,928 bases to be reported in the first position (based on IGV inspection, screenshot below), but only 7,663 get called without a reference and 26 with a reference! Can you tell me what's happening?! Is this a bug? Or should I be doing something differently?

 $ samtools mpileup example.bam |head -n 1
[mpileup] 1 samples in 1 input files
Hs_18s  1   N   7663 ...
 $ samtools mpileup -f example.fa example.bam |head -n 1
[mpileup] 1 samples in 1 input files
Hs_18s  1   T   26  ^~.^~.^~.^~.^~,^~.^~.^~.^~.^$.^~,^~,^~,^~,^~.^~.^~.^~.^~,^~.^~,^~.^~.^~.^~.^~.  EEEEEEE1E:EEEE:EEEEEEE;EEE
image

4samtools_troubleshoot.tar.gz

(run samtools --version)

$ samtools --version
samtools 1.18
Using htslib 1.18
Copyright (C) 2023 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes
    CC:             gcc
    CPPFLAGS:       -I/apps/easybd/easybuild/software/cURL/8.0.1-GCCcore-12.3.0/include -I/apps/easybd/easybuild/software/XZ/5.4.2-GCCcore-12.3.0/include -I/apps/easybd/easybuild/software/bzip2/1.0.8-GCCcore-12.3.0/include -I/apps/easybd/easybuild/software/zlib/1.2.13-GCCcore-12.3.0/include -I/apps/easybd/easybuild/software/ncurses/6.4-GCCcore-12.3.0/include
    CFLAGS:         -O2 -ftree-vectorize -march=x86-64 -mtune=generic -fno-math-errno -fPIC
    LDFLAGS:        -L/apps/easybd/easybuild/software/cURL/8.0.1-GCCcore-12.3.0/lib64 -L/apps/easybd/easybuild/software/cURL/8.0.1-GCCcore-12.3.0/lib -L/apps/easybd/easybuild/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/easybd/easybuild/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/easybd/easybuild/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/easybd/easybuild/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/easybd/easybuild/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/easybd/easybuild/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/easybd/easybuild/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/easybd/easybuild/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/easybd/easybuild/software/GCCcore/12.3.0/lib64 -L/apps/easybd/easybuild/software/GCCcore/12.3.0/lib
    HTSDIR:         htslib-1.18
    LIBS:           -lm -lpthread
    CURSES_LIB:     -lncursesw

HTSlib compilation details:
    Features:       build=configure libcurl=yes S3=yes GCS=yes libdeflate=no lzma=yes bzip2=yes plugins=no htscodecs=1.5.1
    CC:             gcc
    CPPFLAGS:       -I/apps/easybd/easybuild/software/cURL/8.0.1-GCCcore-12.3.0/include -I/apps/easybd/easybuild/software/XZ/5.4.2-GCCcore-12.3.0/include -I/apps/easybd/easybuild/software/bzip2/1.0.8-GCCcore-12.3.0/include -I/apps/easybd/easybuild/software/zlib/1.2.13-GCCcore-12.3.0/include -I/apps/easybd/easybuild/software/ncurses/6.4-GCCcore-12.3.0/include
    CFLAGS:         -O2 -ftree-vectorize -march=x86-64 -mtune=generic -fno-math-errno -fPIC
    LDFLAGS:        -L/apps/easybd/easybuild/software/cURL/8.0.1-GCCcore-12.3.0/lib64 -L/apps/easybd/easybuild/software/cURL/8.0.1-GCCcore-12.3.0/lib -L/apps/easybd/easybuild/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/easybd/easybuild/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/easybd/easybuild/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/easybd/easybuild/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/easybd/easybuild/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/easybd/easybuild/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/easybd/easybuild/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/easybd/easybuild/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/easybd/easybuild/software/GCCcore/12.3.0/lib64 -L/apps/easybd/easybuild/software/GCCcore/12.3.0/lib -fvisibility=hidden

HTSlib URL scheme handlers present:
    built-in:    preload, data, file
    S3 Multipart Upload:     s3w, s3w+https, s3w+http
    Amazon S3:   s3+https, s3+http, s3
    Google Cloud Storage:    gs+http, gs+https, gs
    libcurl:     imaps, pop3, gophers, http, smb, gopher, ftps, imap, smtp, smtps, rtsp, ftp, telnet, mqtt, https, smbs, tftp, pop3s, dict
    crypt4gh-needed:     crypt4gh
    mem:     mem

Please describe your environment. OS (run uname -sr on Linux/Mac OS or wmic os get Caption, Version on Windows)

$ uname -sr
Linux 3.10.0-1160.81.1.el7.x86_64
machine architecture (run uname -m on Linux/Mac OS or  wmic os get OSArchitecture on Windows)
$ uname -m
x86_64
compiler (run gcc --version or clang --version)
$ gcc --version
gcc (GCC) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
jkbonfield commented 3 months ago

As is so often the case with mpileup problems, the answer typically lies in disabling BAQ!

$ samtools mpileup -f example.fa example.bam|awk '{print length($NF)}'|head
[mpileup] 1 samples in 1 input files
26
2024
6177
6818
7489
7656
7669
7639
7662
7686

$ samtools mpileup -B -f example.fa example.bam|awk '{print length($NF)}'|head
[mpileup] 1 samples in 1 input files
7663
7698
7664
7643
7561
7668
7674
7643
7666
7686

It's setting the quality value first the first base to 0 for many sequences. You could also use -Q0 to show all data irrespective of base quality.

Edit: also using --max-depth 99999 to remove the 8000 depth cap, and disabling overlap removal, gives you the numbers you expect:

$ samtools mpileup -d 99999 -x -Q0 -f example.fa example.bam|awk '{print length($NF)}'|head
[mpileup] 1 samples in 1 input files
17928
18045
18270
18468
18712
18817
19103
19712
20025
20227

That's over-counting as it's counting reads rather than templates, but so was the naive count in your screenshot above.

joegeorgeson commented 3 months ago

Thank you, that clarifies most things. So one difference between using -B and -Q0 is also overlap removal? If I use the below, are the bases lost due to overlap removal or Q>13 or both?

Is there a combination of flags where read starts are counted, overlaps get removed and bases excluded based on quality? Or does that need to happen in follow-on processing.

 $ samtools mpileup -d 99999 -x -B -f example.fa example.bam | awk '{print length($NF)}' | head
[mpileup] 1 samples in 1 input files
17622
17852
18028
18191
18210
18575
18868
19363
19736
20019
jkbonfield commented 3 months ago

The -x option disables the overlap removal (so read-pairs with overlapping bases from READ1 and READ2 are counted twice). BAQ changes quality values, but the minimum quality threshold is still high unless you use -Q. The problem was BAQ was setting a lot to quality zero so the default -Q13 removed a lot of data, but even with -x and -B you'll get some bases still that are low quality and without specifying -Q0 you'll get them included.

(If I recall correctly -x is just setting bases to Q0, so -Q0 is essentially already including -x)

joegeorgeson commented 3 months ago

Thank you again, that clears it up! Closing :)

Edit: Actually wanted to add my 2 cents here ...I think the -B flag is outdated (that paper is from 2011 after all). Sequencing tech/base calling has improved significantly. It might be best to have the default setting -B disabled, or maybe update the algorithm so it handles newer sequencing tech 'better'.