pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
782 stars 274 forks source link

in pileup, discrepancy between stepper all and stepper samtools reads #1197

Closed Pagey closed 1 year ago

Pagey commented 1 year ago

Dear all,

Thanks dearly for this invaluable library.

I've not been able to understand from the documentation the precise differences in behavior between the steppers 'all' and 'samtools' in their default settings, which by the docs seem to indicate that they have the same filters engaged (BAM_FUNMAP | BAM_FSECONDARY | BAM_FQCFAIL | BAM_FDUP). Printed also base quality which is above samtools default minimum of 13

Here's a sample code in which different results are obtained from both: (pysam version 0.21.0, Python 3.6.9, Ubuntu 18.04)

import pysam

# from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00120/alignment/HG00120.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
# in ubuntu executed:
# wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00120/alignment/HG00120.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
# wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00120/alignment/HG00120.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai

BAM_FILENAME = 'HG00120.chrom11.ILLUMINA.bwa.GBR.exome.20120522.bam'

chr = 11
coord = 76999

def print_pileup(pileupcolumn):
    for pileupread in pileupcolumn.pileups:
        query_name = pileupread.alignment.query_name
        base_now = pileupread.alignment.query_sequence[ pileupread.query_position]
        mq_now = pileupread.alignment.mapping_quality
        bq_now = pileupread.alignment.query_qualities[ pileupread.query_position]

        print('query_name: ' + query_name +' base_now: '+ base_now + ' mq_now: ' + str(mq_now) + ' bq_now: ' + str(bq_now))

print('stepper all')
samfile = pysam.AlignmentFile(BAM_FILENAME, "rb" )
for pileupcolumn in samfile.pileup(contig=str(chr),start=coord,stop=coord+1,truncate=True, stepper = 'all'):
    print_pileup(pileupcolumn)

print('stepper samtools')
samfile = pysam.AlignmentFile(BAM_FILENAME, "rb" )
for pileupcolumn in samfile.pileup(contig=str(chr),start=coord,stop=coord+1,truncate=True, stepper = 'samtools'):
    print_pileup(pileupcolumn)

output:

ubuntu@ip-XXX~$ python3 stepper.py stepper all query_name: SRR099961.15186690 base_now: A mq_now: 18 bq_now: 21 query_name: SRR099961.45087351 base_now: A mq_now: 36 bq_now: 67 query_name: SRR099961.113296055 base_now: A mq_now: 14 bq_now: 31 stepper samtools query_name: SRR099961.15186690 base_now: A mq_now: 18 bq_now: 21 query_name: SRR099961.45087351 base_now: A mq_now: 36 bq_now: 67 ubuntu@ip-XXX~$

**

PS another question- the wording of the documentation is that for stepper samtools, min_base_quality is defined as all greater or equal and min_mapping_quality is defined as only greater- is this the correct behavior? from the docs:

min_base_quality (int) – Minimum base quality. Bases below the minimum quality will not be output. The default is 13. adjust_capq_threshold (int) – adjust mapping quality. The default is 0 for no adjustment. The recommended value for adjustment is 50. min_mapping_quality (int) – only use reads above a minimum mapping quality. The default is 0.

Pagey commented 1 year ago

adding ignore_orphans = False to samtools and ignore_overlaps = False to both makes outputs mostly identical