pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

about read.get_aligned_pairs(matches_only=True, with_seq=True) #1283

Open bioinformatica opened 4 months ago

bioinformatica commented 4 months ago

I want to count the number of reads for the mutation T>C, I used the following code:

        query_seq = read.query_sequence
        ref_seq = reference.fetch(read.reference_name, read.reference_start, read.reference_end)

        alignments = read.get_aligned_pairs(matches_only=True, with_seq=True)

        for query_pos, ref_pos, ref_base in alignments:
            if ref_base.upper() == 'T' and query_seq[query_pos].upper() == 'C':
                tc_count += 1
                print(query_seq)
                print(ref_seq)
                break

Output: CATTTCTTAGTCCTACACACTAAAAACAGTAATGTGCTGCTTTGAGTGTGTAGGACTAAGAAATGGGATTCAGAGTAGTAAAGAGAAAAGTGGAATTTCCAAGCACTATGAATTACTGTTCTTTAAAAAACAGCAAAAATC

ACTTTAAAAACAGTAATGTGCTGCTTTGAGTGTGTAGGACTAAGAAATGGGATTCAGAGTAGTAAAGAGAAAAGTGGAATTTCCAAGCACTATGAATTACTGTTCTTTAAAAAACAGCAAAAATC

I set matches_only=True. Normally, tccount is always equal to 0. Why are their lengths different? causing tccount to increase

bioinformatica commented 4 months ago

If matches_only=True is set, will only the exact same base pairs be returned? How should I change the code to count the number of reads with t>C?