ness-lab / recombinant-reads

1 stars 0 forks source link

Expand phase change detection script to incorporate mate pairs #6

Closed aays closed 4 years ago

aays commented 4 years ago

Our scripts currently just look at one read at a time, and it means that we're encountering lots of reads where there aren't enough SNPs to call phase changes. Scanning for phase changes across mate pairs will involve creating a lookup table and then using that to create tuples of bam records, from which haplotype phases can be individually determined and then combined (since phase change detection currently just boils down to looking for more than one parental phase in a read). The length of sequence between them can be ignored for now.

The ids in the qname column alongside rnext and pnext can be used to detect mate reads. See SAM specification.

jiyvliu commented 4 years ago

Simple function to cache reads so that mate pairs with the same query name are stored together in a list

def cache_pairs(bam_file_obj):

    cache = {}

    paired = 0
    total = 0

    for record in bam_file_obj:
        name = record.query_name
        total += 1

        if name not in cache:
            cache[name] = [record,None]
        else:
            cache[name][1] = record
            paired += 1

    print('Paired: {paired}, Unpaired: {unpaired}'.format(paired=paired, unpaired=total-paired))    
    return cache

Also prints the number of paired and unpaired reads to console