samtools / samtools

Tools (written in C using htslib) for manipulating next-generation sequencing data
http://htslib.org/
Other
1.62k stars 579 forks source link

mpileup output-QNAME #1562

Closed apallav closed 2 years ago

apallav commented 2 years ago

Are you using the latest version of samtools and HTSlib? If not, please specify.

(run samtools --version)

samtools 1.9 Using htslib 1.9 Copyright (C) 2018 Genome Research Ltd.

Please describe your environment.

Linux 3.10.0-514.6.2.el7.x86_64

x86_64

gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44.0.3)

Please specify the steps taken to generate the issue, the command you are running and the relevant output.

Hi,

I have a question regarding the correctness of output format given out in the context of insertion/ deletions when using mpileup with --output-QNAME option.

I use following command line :

samtools-1.9/bin/samtools mpileup --output-QNAME -d 10000000 -x -A -s -B -a -Q 0 -C 0 -q 0 -r chr17:56435159-56435159 -f reference.fa bamfile.sorted.bam

Issue is two fold:

  1. The number of read Ids do not match with number of entries in the bases field ( field 5) when there are indels

for example:

for position in the commanline, I get 197576 in DP filed ( 4th) and 197576 read ids. But I encounter 198018 bases in column#5 with 442 insertion or deletion bases.

  1. the read Id entries for the insertion or deletions for the next base do not get encoded in the current position, but # bases do. This leads to mis-alignment of the readIDs with the bases in the 5 th field.

is there a way you could also incorporate the readIDs with indels in the current position as well as next position so the read depth and number of read IDs match?

Or Could you give me any pointers to edit the code to make the correction? I am vaguely familiar with C++.

Or do you have any other suggestions for me.

apallav commented 2 years ago

Any comment?

jkbonfield commented 2 years ago

I cannot reproduce this. Do you have any example data?

Eg with the test data supplied in samtools source tree:

@ [samtools.../samtools]; more test/mpileup/mp_DI.sam
@SQ SN:z    LN:13
@CO TAG  CTTAGCA  GGT ref
@CO  AG  CTTAG**TTGGC s1
@CO  AG  CTTAG**TT    s2
@CO  AG  CTTAG***T    s4
@CO  **AACTTAGCA  GG  s3
@CO  ***ACTTAGCA  GG  s5
@CO  01AB2345678AB901 qual
s1  0   z   2   0   7M2D2I3M    *   0   0   
AGCTTAGTTGGC    0123456AB901
s2  0   z   2   0   7M2D2I  *   0   0   AGCTTAGT
T   0123456AB
s4  0   z   2   0   7M2D1P1I    *   0   0   
AGCTTAGT    0123456B
s3  0   z   2   0   2D2I9M  *   0   0   AACTTAGC
AGG AB234567890
s5  0   z   2   0   2D1P1I9M    *   0   0   
ACTTAGCAGG  B234567890

@ [samtools.../samtools]; samtools mpileup --output-QNAME -d 10000000 -x -A -s -B -a -Q 0 -C 0 -q 0 test/mpileup/mp_DI.sam 
[mpileup] 1 samples in 1 input files
[mpileup] Combined max depth is above 1M. Potential memory hog!
z   1   N   0   *   *   *   *
z   2   N   5   ^!A^!A^!A^!*^!* 000AB   !!!!!   s1,s2,s4,s3,s5
z   3   N   5   GGG*+2AA*+2*A   111AB   !!!!!   s1,s2,s4,s3,s5
z   4   N   5   CCCCC   22222   !!!!!   s1,s2,s4,s3,s5
z   5   N   5   TTTTT   33333   !!!!!   s1,s2,s4,s3,s5
z   6   N   5   TTTTT   44444   !!!!!   s1,s2,s4,s3,s5
z   7   N   5   AAAAA   55555   !!!!!   s1,s2,s4,s3,s5
z   8   N   5   G-2NNG-2NNG-2NNGG   66666   !!!!!   s1,s2,s4,s3,s5
z   9   N   5   ***CC   AAB77   !!!!!   s1,s2,s4,s3,s5
z   10  N   5   *+2TT*+2TT$*+2*T$AA AAB88   !!!!!   s1,s2,s4,s3,s5
z   11  N   3   GGG 999 !!! s1,s3,s5
z   12  N   3   GG$G$   000 !!! s1,s3,s5
z   13  N   1   C$  1   !   s1

Every position has s1 to s5 included, regardless of whether there is an insertion or deletion.

I wonder if this problem is related to ectreme depth? It would help if you could try various samtools view -s SEED.1 type reductions, with differing seeds, to try and reduce your data set down to see if you still get the problem. If so, keep reducing it to try and get the smallest example that triggers the bug. That then may explain the specifics of what's going on. Eg is it reads that start with an indel? End with one? Use "P" cigar op? etc.

daviesrob commented 2 years ago

Do you see this problem with the latest version of samtools?

apallav commented 2 years ago

HI, Thanks for getting back.

so for the entry: N 5 G-2NNG-2NNG-2NNGG 66666 !!!!! s1,s2,s4,s3,s5

G, -2NN, G,-2NN,G , -2NN,G,G = 8 reads. but you got only 5 reads in Q name column. Am I interpreting it right?

I have seeded the bam files down to 0.1 . I have corresponding bam/mpielup files. Do you have a place where I can upload them for your review?

Aparna

apallav commented 2 years ago

Do you see this problem with the latest version of samtools?

Not yet. I will test and get back.

daviesrob commented 2 years ago

The line at position 8 should be interpreted as:

(G-2NN) (G-2NN) (G-2NN) (G) (G)

So five reads. The first three have a G followed by two deleted bases, the other two are G.

jkbonfield commented 2 years ago

Maybe this is just a misunderstanding of the mpileup format. The sequence string is annotated with inserted and deleted characters (not just "*", but for the start of the indel it'll be +/- and the sequence. It also gets ^ and $ markers too.

With a modern samtools you can use samtools mpileup --no-output-ins --no-output-ins --no-output-del --no-output-del --no-output-ends in.bam. Note some of those options appear twice, to indicate complete removal. This leaves just the raw sequence, so the sequence length should then match the depth, the length of qualities, and the number of QNAME entries.

Also, regarding producing a smaller example. My suggestion is to try different -s "seed.fraction" parameters. If .1 still shows the problem, then try .01, .001, etc. Once you've got the smallest problematic file you can get, try starting from that new file using another -s option to further reduce it (and vary the seed to remove a different random selection). This way you can gradually whittle down a large data set to the smallest one that demonstrates a problem. It's a useful technique for understanding what's actually going on with your data as you can sift out the "normal" reads until you're just left with the problem.

apallav commented 2 years ago

The line at position 8 should be interpreted as:

(G-2NN) (G-2NN) (G-2NN) (G) (G)

So five reads. The first three have a G followed by two deleted bases, the other two are G.

With v1.9 I am seeing this format:

-1A.-1A.......-1A............-1A..-1A......-1A.-1A................................................-1A.....-1A....-1A.-1A.-1A......,,,,,,,$,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,$,$,,,,,,,,,,,,,,,,,,,,$,

does .-1A mean it is wild type base followed by A deletion?

1.9 version documentation here does not mention indeed context ( base followed by indel format)

A pattern \\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence. Similarly, a pattern-[0-9]+[ACGTNacgtn]+' represents a deletion from the reference. The deleted bases will be presented as `*' in the following lines.

So given the format, I think I am right in interpreting:

The number of read Ids do not match with number of entries in the bases field ( field 5) when there are indels for example:

for position in the commanline, I get 197576 in DP filed ( 4th) and 197576 read ids. But I encounter 198018 bases in column#5 with 442 insertion or deletion bases.

the read Id entries for the insertion or deletions for the next base do not get encoded in the current position, but # bases do. This leads to mis-alignment of the readIDs with the bases in the 5 th field.

apallav commented 2 years ago

Do you see this problem with the latest version of samtools?

Not yet. I will test and get back.

I tested using latest version (v1.14) . Right off the bat, it is strange that output is not reporting all the read data.

$cat test.chr17-56435159.v14.mp|awk -F "\t" '{print $1,$2,$3,$4}'

chr17 56435159 G 197576

where as with v1.9 :

$cat test.chr17-56435159.v9.mp|awk -F "\t" '{print $1,$2,$3,$4}' chr17 56435159 G 197576

jkbonfield commented 2 years ago

The "." is because you've supplied a reference, so "." and "," indicate it matched the reference. I'm not sure if that's always "wildtype" as sometimes the reference has the rare allele, but close enough.

I don't understand your following sentence though:

1.9 version documentation here does not mention indeed context ( base followed by indel format)

I believe the tool is working as intended and as documented.

jkbonfield commented 2 years ago

I tested using latest version (v1.14) . Right off the bat, it is strange that output is not reporting all the read data.

$cat test.chr17-56435159.v14.mp|awk -F "\t" '{print $1,$2,$3,$4}' chr17 56435159 G 197576

where as with v1.9 :

$cat test.chr17-56435159.v9.mp|awk -F "\t" '{print $1,$2,$3,$4}' chr17 56435159 G 197576

These are identical. Did you intend to post something else?

Mpileup will report all the data that passes the default filtering criteria, which includes depth, mapping quality, base quality and overlap removal. (Maybe more.) You seem to have adjusted for those. What makes you believe it's not reporting everything?

apallav commented 2 years ago

The "." is because you've supplied a reference, so "." and "," indicate it matched the reference. I'm not sure if that's always "wildtype" as sometimes the reference has the rare allele, but close enough.

I don't understand your following sentence though:

v1.9 documentation does not accurately describe the format Reference base followed by -\d+[Aa-Zz]+ and so on.

Ref: (G-2NN) (G-2NN) (G-2NN) (G) (G) is the correct way to interpret,

v1.9 Documentation describes as follows:

"A pattern `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this reference position and the next reference position."

Ref: (G-2NN) (G-2NN) (G-2NN) (G) (G) is the correct way to interpret,

means the document should have stated:

"A pattern (.|,)\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this reference position and the next reference position."

If this is correct,i.e., (.|,)\+[0-9]+[ACGTNacgtn]+' or (.|,)\-[0-9]+[ACGTNacgtn]+', it will account for the reads Ids in the QNAME column ( 8th)

1.9 version documentation here does not mention indeed context ( base followed by indel format)

I believe the tool is working as intended and as documented.

I am sure . I think there is mis representation either in the documentation or in the output that I am seeing descrepant count between read bases ( column 5) and readIds ( column 8).

I would like to make sure output format described in the manual is clear for v1.9 at least.

Please also take a look at my comment regarding testing v1.14.

apallav commented 2 years ago

I tested using latest version (v1.14) . Right off the bat, it is strange that output is not reporting all the read data. $cat test.chr17-56435159.v14.mp|awk -F "\t" '{print $1,$2,$3,$4}' chr17 56435159 G 197576 where as with v1.9 : $cat test.chr17-56435159.v9.mp|awk -F "\t" '{print $1,$2,$3,$4}' chr17 56435159 G 197576

These are identical. Did you intend to post something else?

Mpileup will report all the data that passes the default filtering criteria, which includes depth, mapping quality, base quality and overlap removal. (Maybe more.) You seem to have adjusted for those. What makes you believe it's not reporting everything?

Here is my command line:

/production/samtools/samtools-1.14/bin/samtools mpileup -d 10000000 -x -A -s -B -a -Q 0 -C 0 -q 0 -r chr17:56435159-56435159 -f reference.fa bamfile.sorted.bam

I am specifying not to filter any data. Then why the delta in DP column between v1.9 and v1.14?

apallav commented 2 years ago

I tested using latest version (v1.14) . Right off the bat, it is strange that output is not reporting all the read data. $cat test.chr17-56435159.v14.mp|awk -F "\t" '{print $1,$2,$3,$4}' chr17 56435159 G 197576 where as with v1.9 : $cat test.chr17-56435159.v9.mp|awk -F "\t" '{print $1,$2,$3,$4}' chr17 56435159 G 197576

These are identical. Did you intend to post something else? Mpileup will report all the data that passes the default filtering criteria, which includes depth, mapping quality, base quality and overlap removal. (Maybe more.) You seem to have adjusted for those. What makes you believe it's not reporting everything?

Here is my command line:

/production/samtools/samtools-1.14/bin/samtools mpileup -d 10000000 -x -A -s -B -a -Q 0 -C 0 -q 0 -r chr17:56435159-56435159 -f reference.fa bamfile.sorted.bam

I am specifying not to filter any data. Then why the delta in DP column between v1.9 and v1.14?

Sorry, copy paste error:

with v1.14 I am seeing:

chr17 56435159 G 17227

and with v1.9 I am noticing:

chr17 56435159 G 197576

jmarshall commented 2 years ago

I am sure . I think there is mis representation either in the documentation or in the output that I am seeing descrepant count between read bases ( column 5) and readIds ( column 8).

You are misunderstanding the old documentation, which is understandable as it is not very clear. You should use the current mpileup format documentation, which describes exactly the same format as your samtools 1.9 is emitting but describes it more clearly.

jkbonfield commented 2 years ago

v1.9 Documentation describes as follows:

"A pattern `+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this reference position and the next reference position."

Ref: (G-2NN) (G-2NN) (G-2NN) (G) (G) is the correct way to interpret,

means the document should have stated:

"A pattern (.|,)+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this reference position and the next reference position."

It's describing the regular expression that describes an indel, not the regular expression for an entire base entry. You're stating a regular expression that describes a base-call followed by an indel, which inevitably includes the former as a component of it. I think you're rather "splitting hairs" here, and it's not even complete as you it could include base-calls if not using a reference. It's too complex to try and describe the entire line as a regexp, so it just describes the component it is talking about: an indel. The first anchor base is not part of an insertion or deletion, so the documentation is correct.

If this is correct,i.e., (.|,)+[0-9]+[ACGTNacgtn]+' or (.|,)-[0-9]+[ACGTNacgtn]+', it will account for the reads Ids in the QNAME column ( 8th)

The read ids in the QNAME column refer to the number of reads aligning to that position. The change in regular expression has nothing to do with that.

Please also take a look at my comment regarding testing v1.14.

There have been numerous bug fixes made between 1.9 and 1.14. I think I'd need to see the data to be able to say which value is the correct one and to pinpoint why the numbers differ. How many reads actually overlap that point? You should be able to do:

samtools view bamfile.sorted.bam chr17:56435159-56435159 |wc -l

to get a figure for the unfiltered score.

It could be to do with deletions though, or possibly something related to extreme depth. I checked all the way back to samtools 1.2 and it matched 1.14 for depth across 1 million positions, but my data wasn't 200K fold deep.

apallav commented 2 years ago

$ /production/samtools/samtools-1.9/bin/samtools view test.sorted.bam chr17:56435159-56435159 |wc -l 197579 $ /production/samtools/samtools-1.14/bin/samtools view test.sorted.bam chr17:56435159-56435159 |wc -l 197579 $ /production/samtools/samtools-1.9/bin/samtools mpileup -r chr17:56435159-56435159 -d 10000000 -x -A -s -B -a -Q 0 -C 0 -q 0 -f /production/References/bwa_indexes/bwa-0.7.17/hg19.viralIndices/hg19.Viral.fa test.sorted.bam|awk -F "\t" '{print $1,$2,$3,$4}' [mpileup] 1 samples in 1 input files [mpileup] Combined max depth is above 1M. Potential memory hog! chr17 56435159 G 197576 $ /production/samtools/samtools-1.14/bin/samtools mpileup -r chr17:56435159-56435159 -d 10000000 -x -A -s -B -a -Q 0 -C 0 -q 0 -f /production/References/bwa_indexes/bwa-0.7.17/hg19.viralIndices/hg19.Viral.fa test.sorted.bam|awk -F "\t" '{print $1,$2,$3,$4}' [mpileup] 1 samples in 1 input files [mpileup] Combined max depth is above 1M. Potential memory hog! chr17 56435159 G 17227 $

apallav commented 2 years ago

$ /production/samtools/samtools-1.9/bin/samtools view test.sorted.bam chr17:56435159-56435159 |wc -l 197579 $ /production/samtools/samtools-1.14/bin/samtools view test.sorted.bam chr17:56435159-56435159 |wc -l 197579 $ /production/samtools/samtools-1.9/bin/samtools mpileup -r chr17:56435159-56435159 -d 10000000 -x -A -s -B -a -Q 0 -C 0 -q 0 -f /production/References/bwa_indexes/bwa-0.7.17/hg19.viralIndices/hg19.Viral.fa test.sorted.bam|awk -F "\t" '{print $1,$2,$3,$4}' [mpileup] 1 samples in 1 input files [mpileup] Combined max depth is above 1M. Potential memory hog! chr17 56435159 G 197576 $ /production/samtools/samtools-1.14/bin/samtools mpileup -r chr17:56435159-56435159 -d 10000000 -x -A -s -B -a -Q 0 -C 0 -q 0 -f /production/References/bwa_indexes/bwa-0.7.17/hg19.viralIndices/hg19.Viral.fa test.sorted.bam|awk -F "\t" '{print $1,$2,$3,$4}' [mpileup] 1 samples in 1 input files [mpileup] Combined max depth is above 1M. Potential memory hog! chr17 56435159 G 17227 $

$ /production/samtools/samtools-1.14/bin/samtools --version samtools 1.14 Using htslib 1.14 Copyright (C) 2021 Genome Research Ltd.

Samtools compilation details: Features: build=configure curses=yes CC: gcc CPPFLAGS:
CFLAGS: -Wall -g -O2 LDFLAGS:
HTSDIR: htslib-1.14 LIBS:
CURSES_LIB: -lncursesw

HTSlib compilation details: Features: build=configure plugins=no libcurl=yes S3=yes GCS=yes libdeflate=no lzma=yes bzip2=yes htscodecs=1.1.1-1-ged325d7 CC: gcc CPPFLAGS:
CFLAGS: -Wall -g -O2 -fvisibility=hidden LDFLAGS: -fvisibility=hidden

HTSlib URL scheme handlers present: built-in: preload, data, file S3 Multipart Upload: s3w, s3w+https, s3w+http Amazon S3: s3+https, s3+http, s3 Google Cloud Storage: gs+http, gs+https, gs libcurl: imaps, pop3, http, gopher, sftp, ftps, imap, smtp, smtps, rtsp, scp, ftp, telnet, ldap, https, ldaps, tftp, pop3s, dict crypt4gh-needed: crypt4gh mem: mem

apallav commented 2 years ago

It's describing the regular expression that describes an indel, not the regular expression for an entire base entry. You're stating a regular expression that describes a base-call followed by an indel, which inevitably includes the former as a component of it. I think you're rather "splitting hairs" here, and it's not even complete as you it could include base-calls if not using a reference. It's too complex to try and describe the entire line as a regexp, so it just describes the component it is talking about: an indel. The first anchor base is not part of an insertion or deletion, so the documentation is correct.

Unfortunately I have to split hair :) literally. I am aligning the base change ( Ref,mismatch or indwells) to corresponding entry in the 8 column to get the read ID that has that base change. It is extremely easy to parse this and not at all time consuming via mpileup rather than venture out my own code mapping CIGAR to the T of each read @ 200K read depth at any given position. I am later using the read tags to calculate UMI families. I know there are tools out there that do this kind of calculation, but using pileup format gets me much more detailed calculations than the other tools will give me.

If this is correct,i.e., (.|,)+[0-9]+[ACGTNacgtn]+' or (.|,)-[0-9]+[ACGTNacgtn]+', it will account for the reads Ids in the QNAME column ( 8th)

The read ids in the QNAME column refer to the number of reads aligning to that position. The change in regular expression has nothing to do with that.

It does makes difference to my application as I am trying to get the index of the base position to associate that to read at the index in the 8th column.

So far mpileup has been really consistent when at mis-matches in identifying the read IDs with particular mismatch.

jkbonfield commented 2 years ago

It does makes difference to my application as I am trying to get the index of the base position to associate that to read at the index in the 8th column.

So far mpileup has been really consistent when at mis-matches in identifying the read IDs with particular mismatch.

I'm still not understanding your point.

The mpileup sequence column is reporting multiple things.

The names only apply to the first of those categories - bases. We shouldn't expect extra read IDs for any other symbol appearing in the sequence column. It just makes no sense to do that. If your parsing is getting confused, then use the options I outlined above for removal of that extra markup so you only see the reference position specific bases.

As for why 1.14 is giving very different depth figures to 1.9, I'm baffled. I tried it on a sample with 260k depth and it worked fine. We'd need to see your data, or for you to filter it down to lower depth and investigate the cause of the problem.

apallav commented 2 years ago

Ok. Sorry for the confusion. How do I figure out what read has deletion or insertion at the given base from mpilup output when using output-Qname option.

apallav commented 2 years ago

Ok. Sorry for the confusion. How do I figure out what read has deletion or insertion at the given base from mpilup output when using output-Qname option.

apallav commented 2 years ago

“As for why 1.14 is giving very different depth figures to 1.9, I'm baffled. I tried it on a sample with 260k depth and it worked fine. We'd need to see your data, or for you to filter it down to lower depth and investigate the cause of the problem.”

I am using 1.9 in my production and happy with it for using pileup option. I will evaluate when I try to upgrade to 1.14 when I am ready.

But for your testing, I can upload my data for the coordinate in question. Just let me know where.

apallav commented 2 years ago

“As for why 1.14 is giving very different depth figures to 1.9, I'm baffled. I tried it on a sample with 260k depth and it worked fine. We'd need to see your data, or for you to filter it down to lower depth and investigate the cause of the problem.”

I am using 1.9 in my production and happy with it for using pileup option. I will evaluate when I try to upgrade to 1.14 when I am ready.

But for your testing, I can upload my data for the coordinate in question. Just let me know where.

jkbonfield commented 2 years ago

Sorry, I don't have access to a server for you to upload things to. I suggest using something like dropbox or host it at your end somewhere.

How do I figure out what read has deletion or insertion at the given base from mpilup output when using output-Qname option.

That'll be the markup bit I mentioned.

The man page states:

     o A single character indicating the read base and the strand to which
       the read has been mapped:

       Forward   Reverse                    Meaning
       ---------------------------------------------------------------
        . dot    , comma   Base matches the reference base
        ACGTN     acgtn    Base is a mismatch to the reference base

          >         <      Reference skip (due to CIGAR "N")
          *        */#     Deletion of the reference base (CIGAR "D")

       Deleted bases are shown as "*" on both strands unless --reverse-del
       is used, in which case they are shown as "#" on the reverse strand.

     o If  there  is  an  insertion  after  this  read base, text matching
       "\+[0-9]+[ACGTNacgtn*#]+": a "+" character followed by  an  integer
       giving  the length of the insertion and then the inserted sequence.
       Pads are shown as "*" unless --reverse-del is used, in  which  case
       pads on the reverse strand will be shown as "#".

So you see it lists the base (dot/comma, ACGTN, >/< or */#) followed by the markup indicating inserted and deleted bases. The QNAME option is irrelevant and doesn't affect the sequence column. Although if you're trying to marry the two together, it's strictly one QNAME per sequence aligned to the reference base (regardless of indels).

apallav commented 2 years ago

is there an email that I can use to share the data using "box"?

jkbonfield commented 2 years ago

Try samtools-core at sanger dot ac.uk

apallav commented 2 years ago

Hi,

Sorry for the delay. I shared the bam file over one drive. Please let me know how it goes.

I have a followup question. is it possible / how easy it is to generate a semicolon delimited base string ( field # 5) in pileup output?

jkbonfield commented 2 years ago

Thanks for the test data.

Firstly, I see identical behaviour between samtools 1.9 and samtools 1.14, even down to the same md5sum.

Secondly, the number of QNAME entries matches the depth column, both showing 17227. This is also matched by the number of (non-markup) characters in the sequence column:

samtools mpileup --no-output-ins --no-output-ins --no-output-del --no-output-del --no-output-ends --output-QNAME -d 10000000 -x -A -s -B -a -Q 0 -C 0 -q 0 -r chr17:56435159-56435159 test.563.chr17-56435159.sorted.bam|cut -f 5|tr -d '\012'|wc -c

Therefore I believe the software is working as intended. If you are getting a different number of bases, I suspect it's because you're using the fully marked up sequence column and not correctly accounting for the number of additional characters added. (This can be challenging as it's not something a POSIX regexp can remove due to the counting required.)

apallav commented 2 years ago

Hi,

Thanks for spending time to test. However I disagree that the output is same between the versions.

Could you report what is DP using both versions?
Could you also use -f flag with hg19 reference reference file?

My data is generated using deep sequencing from targeted capture. I am using bwa-mem aligner and Picard tools mark duplicates with mate CIGAR to mark the duplicates. I will have lot of overlapping reads, so using -x is essential.

Here is what I am finding:

$ samtools view test.563.chr17-56435159.sorted.bam|wc -l 197579

$ samtools view test.563.chr17-56435159.sorted.bam chr17:56435159-56435159 |wc -l 197579

1.9 pileup with default settings: DP=7860

$ /production/samtools/samtools-1.9/bin/samtools mpileup -r chr17:56435159-56435159 -f /production/References/bwa_indexes/bwa-0.7.17/hg19.viralIndices/hg19.Viral.fa test.563.chr17-56435159.sorted.bam|awk -F "\t" '{print $1,$2,$3,$4}' [mpileup] 1 samples in 1 input files chr17 56435159 G 7860 $

1.9 pileup with -d 0 -x -A -s -B -Q 0 -C 0 -q 0 , DP = 197576

$ /production/samtools/samtools-1.9/bin/samtools mpileup -r chr17:56435159-56435159 -d 0 -x -A -s -B -Q 0 -C 0 -q 0 -f /production/References/bwa_indexes/bwa-0.7.17/hg19.viralIndices/hg19.Viral.fa test.563.chr17-56435159.sorted.bam|awk -F "\t" '{print $1,$2,$3,$4}' [mpileup] 1 samples in 1 input files [mpileup] Max depth set to maximum value (2147483647) chr17 56435159 G 197576 $


1.14 pileup with default settings: DP=7758

$ /production/samtools/samtools-1.14/samtools mpileup -r chr17:56435159-56435159 -f /production/References/bwa_indexes/bwa-0.7.17/hg19.viralIndices/hg19.Viral.fa test.563.chr17-56435159.sorted.bam|awk -F "\t" '{print $1,$2,$3,$4}' [mpileup] 1 samples in 1 input files chr17 56435159 G 7758 $

1.9 pileup with -d 0 -x -A -s -B -Q 0 -C 0 -q 0 , DP = 17227

$ /production/samtools/samtools-1.14/samtools mpileup -r chr17:56435159-56435159 -d 0 -x -A -s -B -Q 0 -C 0 -q 0 -f /production/References/bwa_indexes/bwa-0.7.17/hg19.viralIndices/hg19.Viral.fa test.563.chr17-56435159.sorted.bam|awk -F "\t" '{print $1,$2,$3,$4}' [mpileup] 1 samples in 1 input files [mpileup] Max depth set to maximum value (2147483647) chr17 56435159 G 17227 $

apallav commented 2 years ago

Also loding test bam in IGV show following coverage track metrics:

chr17:56,435,159 Total count: 197579 A : 20 (0%, 12+, 8- ) C : 61 (0%, 27+, 34- ) G : 196441 (99%, 94909+, 101532- ) T : 121 (0%, 34+, 87- ) N : 936 (0%, 463+, 473- )

jkbonfield commented 2 years ago

Could you report what is DP using both versions? Could you also use -f flag with hg19 reference reference file?

DP is a VCF term and we're not using VCF format, but I'm assuming you mean the equivalent which is depth in column 4. I already gave this figure - it's 17227.

My data is generated using deep sequencing from targeted capture. I am using bwa-mem aligner and Picard tools mark duplicates with mate CIGAR to mark the duplicates. I will have lot of overlapping reads, so using -x is essential.

I am using -x: see the command line I quoted above.

Here is what I am finding:

$ samtools view test.563.chr17-56435159.sorted.bam chr17:56435159-56435159 |wc -l

197579

I get the same. However you are maybe being confused by the duplicates:

$ samtools view -h test.563.chr17-56435159.sorted.bam chr17:56435159-56435159 | samtools flagstat -
197579 + 0 in total (QC-passed reads + QC-failed reads)
197576 + 0 primary
3 + 0 secondary
0 + 0 supplementary
180349 + 0 duplicates
180349 + 0 primary duplicates
197579 + 0 mapped (100.00% : N/A)
197576 + 0 primary mapped (100.00% : N/A)
197576 + 0 paired in sequencing
99538 + 0 read1
98038 + 0 read2
196914 + 0 properly paired (99.66% : N/A)
197359 + 0 with itself and mate mapped
217 + 0 singletons (0.11% : N/A)
245 + 0 with mate mapped to a different chr
245 + 0 with mate mapped to a different chr (mapQ>=5)

So that's 197576 records of which 180349 have been marked as duplicate, leaving 17227, which is what I see. The software is working absolutely as it should.

It's possible you wanted to include the duplicates as with extreme depth most duplicates will be fake, although I wonder in that case why you used the Picard MarkDuplicates command? If you've got this data already and you wish to ignore the duplicate flag, then you can use the mpileup --ff option. This gets you the 197579 figure you desire.

$ samtools mpileup --ff UNMAP,SECONDARY,QCFAIL -d0 -x -A -Q0 -q0 -r chr17:56435159-56435159 test.563.chr17-56435159.sorted.bam|cut -f 4
[mpileup] 1 samples in 1 input files
[mpileup] Max depth set to maximum value (2147483647)
197576

I get the same figure using a copy of samtools 1.9 too with that option enabled. I still cannot explain why your copy of 1.9 differs to my copy of 1.9, but I think it's a red herring here. We should concentrate on getting the current release to work for you.

My instinct is it's simply a case of inappropriate application of a best-practices guide for your data, which is causing the problems you are seeing. Specifically requesting that PCR duplicates are marked on data where, by chance, we'll have many sequences starting and stopping at the same location due to the sheer depth.

apallav commented 2 years ago

Could you report what is DP using both versions? Could you also use -f flag with hg19 reference reference file?

DP is a VCF term and we're not using VCF format, but I'm assuming you mean the equivalent which is depth in column 4. I already gave this figure - it's 17227.

Yes. Yes. Sorry for the confusion.

My data is generated using deep sequencing from targeted capture. I am using bwa-mem aligner and Picard tools mark duplicates with mate CIGAR to mark the duplicates. I will have lot of overlapping reads, so using -x is essential.

I am using -x: see the command line I quoted above.

Here is what I am finding:

$ samtools view test.563.chr17-56435159.sorted.bam chr17:56435159-56435159 |wc -l

197579

I get the same. However you are maybe being confused by the duplicates:

I would like to see the pileup of all reads involved ( same as I see visually in IGV) so yes I am counting duplicates at any given region.

$ samtools view -h test.563.chr17-56435159.sorted.bam chr17:56435159-56435159 | samtools flagstat - 197579 + 0 in total (QC-passed reads + QC-failed reads) 197576 + 0 primary 3 + 0 secondary 0 + 0 supplementary 180349 + 0 duplicates 180349 + 0 primary duplicates 197579 + 0 mapped (100.00% : N/A) 197576 + 0 primary mapped (100.00% : N/A) 197576 + 0 paired in sequencing 99538 + 0 read1 98038 + 0 read2 196914 + 0 properly paired (99.66% : N/A) 197359 + 0 with itself and mate mapped 217 + 0 singletons (0.11% : N/A) 245 + 0 with mate mapped to a different chr 245 + 0 with mate mapped to a different chr (mapQ>=5)


So that's 197576 records of which 180349 have been marked as duplicate, leaving 17227, which is what I see. The software is working absolutely as it should.

It's possible you wanted to include the duplicates as with extreme depth most duplicates will be fake, although I wonder in that case why you used the Picard MarkDuplicates command? If you've got this data already and you wish to ignore the duplicate flag, then you can use the mpileup `--ff` option. This gets you the 197579 figure you desire.

This makes sense. Yes, we expect lot of reads with same start-stops ( as we do have bunch of PCR cycles pre/post hybridization) , it does not matter how many because we use to collapse them to one or more UMI family in the end.

$ samtools mpileup --ff UNMAP,SECONDARY,QCFAIL -d0 -x -A -Q0 -q0 -r chr17:56435159-56435159 test.563.chr17-56435159.sorted.bam|cut -f 4 [mpileup] 1 samples in 1 input files [mpileup] Max depth set to maximum value (2147483647) 197576

I get the same figure using a copy of samtools 1.9 too with that option enabled. I still cannot explain why your copy of 1.9 differs to my copy of 1.9, but I think it's a red herring here. We should concentrate on getting the current release to work for you.

Agree.

1.9 version gets 197576 with or without --ff UNMAP,SECONDARY,QCFAIL flag

but 1.14 gets 197576 with --ff UNMAP,SECONDARY,QCFAIL flag and 17227 without --ff UNMAP,SECONDARY,QCFAIL flag.

--ff, --excl-flags STR|INT filter flags: skip reads with mask bits set - I guess it means to INCLUDE reads with mask bits set in the output. Noted.

Thank you very much for explanation.

My instinct is it's simply a case of inappropriate application of a best-practices guide for your data, which is causing the problems you are seeing. Specifically requesting that PCR duplicates are marked on data where, by chance, we'll have many sequences starting and stopping at the same location due to the sheer depth.

Correct. I do have lot of reads starting - stopping at same location and average of @150K reads at any given coordinate in targeted regions. I would prefer to use latest version for my testing, now that I am clear.

It will be immensely useful to use a delimiting character, such as semicolon, in the base string of mpileup output. Is it possible to add this as a feature request?

jkbonfield commented 2 years ago

Closing as we believe this to be an issue of inappropriate duplicate removal, and workarounds have been provided.

I still don't understand why you see differences in 1.9 to me, but it's somewhat irrelevant given the current version behaves the same for both of us.

kaanokay commented 6 months ago

Hi @jkbonfield

I have been working on mpileup output for long time. I'm trying to develop a method by using mpileup output.

I want to match base symbols to read ids but I haven't figured that out.

Is there any way to do that easily? any suggestion would be super helpful for me.

Screenshot from 2024-03-28 22-02-12

All the best.

jkbonfield commented 6 months ago

Please don't add new questions to old closed issues. Best to create new ones, or for general usage queries try something like BioStars which will reach a larger audience.

For this issue though, it's simply a matter of parsing the sequence string and matching those bases to qualities to IDs, as they're all in the same order. (I don't understand where your second table comes from. It appears to have no relationship to the mpileup above.)

So in your example ,$ is 1st read (now terminating, hence $) which will have qual J and read name 4af... Then . (b6d...), and ,-1g (3ec) and so on.

It can be a horrible format to parse. IF you don't care about begin / end markers (^ and $) and don't care about indels, then you can use samtools mpileup --no-output-ends --no-output-ins --no-output-del to reduce the clutter and make it easier to process.

However ideally you'd not use this method at all and just go direct to the C API and code something up that way. It will be considerably faster than converting to string form and attempting to parse it back again. That will give you access to the raw BAM objects, allowing you to query whatever you wish directly. It's basically just an API which rotates the data, giving you columns of bases at a time instead of alignment by alignment. See https://github.com/samtools/htslib/blob/develop/samples/pileup.c for a basic example.

kaanokay commented 6 months ago

Dear @jkbonfield,

Thank you for quick response. I'd open another issue for this sure, that's my bad.

, should correspond to first read (4a..), $. should correspond to second read (b6..), and ,-1g should correspond to third read. Am I wrong?

I created a table containing symbols and corresponding reads. In my experience in IGV, symbols should be matched with reads as shown in table below.

Screenshot from 2024-04-03 10-53-33.

Unfortunately, I need to keep all indels and deletions.

Do you know to way to use this API to get a symbol and read-id match from bam files? An example would be super helpful.

All the best.