Closed aays closed 4 years ago
Simple function to cache reads so that mate pairs with the same query name are stored together in a list
def cache_pairs(bam_file_obj):
cache = {}
paired = 0
total = 0
for record in bam_file_obj:
name = record.query_name
total += 1
if name not in cache:
cache[name] = [record,None]
else:
cache[name][1] = record
paired += 1
print('Paired: {paired}, Unpaired: {unpaired}'.format(paired=paired, unpaired=total-paired))
return cache
Also prints the number of paired and unpaired reads to console
Our scripts currently just look at one read at a time, and it means that we're encountering lots of reads where there aren't enough SNPs to call phase changes. Scanning for phase changes across mate pairs will involve creating a lookup table and then using that to create tuples of bam records, from which haplotype phases can be individually determined and then combined (since phase change detection currently just boils down to looking for more than one parental phase in a read). The length of sequence between them can be ignored for now.
The ids in the qname column alongside rnext and pnext can be used to detect mate reads. See SAM specification.