philres / ngmlr

NGMLR is a long-read mapper designed to align PacBio or Oxford Nanopore (standard and ultra-long) to a reference genome with a focus on reads that span structural variations
MIT License
293 stars 40 forks source link

calculate the identity of alignments #63

Closed LinXialab closed 5 years ago

LinXialab commented 5 years ago

Hi,

We want to compare the identity of alignments which mapped by minimap2 and NGMLR. Could you please tell us how to calculate the identity from the sam file generated by NGMLR with default parameters?

Looking forward your reply!

wdecoster commented 5 years ago

You will have to parse the MD string for that, as if I remember correctly NGMLR doesn't use the NM tag.

In python:

import re
import pysam 

edit_distances = []
for read in pysam.AlignmentFile("yourfile.bam", "rb"):
    edit_distances.append(
            (sum([len(item) for item in re.split('[0-9^]', read.get_tag("MD"))]) +  # Parse MD string to get mismatches/deletions
            sum([item[1] for item in read.cigartuples if item[0] == 1]))  # Parse cigar to get insertions
            /read.query_alignment_length)
LinXialab commented 5 years ago

Thanks for your reply. The python script your provided may help us a lot. Another question: Could you please tell us what the tag "XI:f:" means? I have found that the number follow by "XI:f" represent identity (https://github.com/philres/ngmlr/blob/master/src/SAMWriter.cpp).

wdecoster commented 5 years ago

Based on that you can get the alignment identity from the XI tag then. Cool, was not aware that this tag was there.

fritzsedlazeck commented 5 years ago

Yes the XI tag gives the alignment identify. Thanks Fritz

LinXialab commented 5 years ago

Excuse me, there are many kinds of identity, like BLAST identity. Could you please tell me the tag 'XI' represents which kind of identity?

fritzsedlazeck commented 5 years ago

Its the number of differences in the alignment divided by alignment length. Cheers Fritz