pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
776 stars 275 forks source link

Overlapping paired-end read quality scores #713

Open lowandrew opened 6 years ago

lowandrew commented 6 years ago

When getting query base qualities out of a pileup, I'm often getting numbers that are way higher than I would expect qualities to be (>70).

As far as I can tell, this happens when paired end reads overlap - in the overlap regions, quality looks like it's getting reported as the sum of the forward read and reverse read quality, whereas I would have expected to get the two qualities out individually.

Not entirely sure if this is intended behaviour or a bug.

Code to get base qualities out is below, this occurs using samtools 1.6 and pysam 0.14.1.

Thanks!

import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb" )
for pileupcolumn in samfile.pileup():
    for pileupread in pileupcolumn.pileups:
        if not pileupread.is_del and not pileupread.is_refskip:
            # query position is None if is_del or is_refskip is set.
            print(pileupread.alignment.query_qualities[pileupread.query_position])
            # Gives sum of forward + reverse read quality at that base in overlap locations?
samfile.close()
rpadmanabhan commented 4 years ago

I see this as well, i think it can lead to ambiguity when filtering reads by their base quality during variant calling if the qualities are getting summed.

dejonggr commented 4 years ago

I'm also seeing this behaviour. Has anyone resolved this without ditching the pileup class?

tinyheero commented 3 years ago

Running into the same issue. It seems to be a bug because samfile.pileup() has an argument ignore_overlaps with the following description:

ignore_overlaps: bool

  If set to True, detect if read pairs overlap and only take
  the higher quality base. This is the default.

So it should be taking the higher quality base, but it's not and instead summing them up.

tinyheero commented 3 years ago

Dug into the issue a bit more and turns out that this behaviour is associated with mpileup:

https://github.com/pysam-developers/pysam/blob/c6275315a8086858dae67f94c5caaf79a020fc46/pysam/libchtslib.pxd#L1221

This post was also helpful (https://github.com/samtools/samtools/issues/1146#issuecomment-559756496).

My understanding is that the sum of the base qualities is the expected behaviour when the overlapping reads have the same base at that position. You can turn off the handling of overlapping reads by using setting ignore_overlaps=False, which is equivalent to --ignore-overlaps in mpileup, and then handle it the reads independently.