Incorrect mismatch ratio

songbowang125 / SVision-pro

GNU General Public License v3.0

33 stars 3 forks source link

https://github.com/songbowang125/SVision-pro/blob/02c935480e869adc92ec94603669aa4466699269/src/cigar_op.py#L469

Currently it is the ratio of mismatch or indel occurence time to aligned reference length.

It should be the ratio of mismatch or indel total size to aligned reference length:

def cal_align_mismatch_ratio(cigar_tuples, align_ref_start, align_ref_end):
    op_counter = {8: 0, 1: 0, 2: 0}
    # 8 denotes "X"; See https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment.cigartuples
    for ele_type, ele_size in cigar_tuples:
        if ele_type not in op_counter:
            op_counter[ele_type] = 0
        op_counter[ele_type] += ele_size
    return (op_counter[8] + op_counter[1] + op_counter[2]) / (align_ref_end - align_ref_start)

align_mismatch_ratio = cal_align_mismatch_ratio(align.cigartuples, align.reference_start, align.reference_end)

More importantly, many aligners cannot output "X" in CIGAR for mismatches. So it is better to incorporate the MD tag to calculate mismatch ratios accurately.

Actually, this mismatch ratio reflects the degree of impact of alignment artifacts (e.g. perhaps caused by segDups). Therefore, we used the ‘total_count / ref_span’ as the mismatch ratio. If you visualize an artifact alignment region using IGV, you might find there were massive mismatches or short indels (1-2bp). Moreover, the denominator is the reference span of the read alignment, not the read length. Therefore, this kind of ‘total_count / ref_span’ works to find alignment artifacts.

It's true ‘total_size / ref_span’ also seems to be right, while what if there is a real but large deletion (or insertion) in a read? That will make the ‘total_size / ref_span’ high, and we would mistakenly considered this read as a alignment artifact.

It true that not all aligners output 'X'. For example, among the three most frequently used aligners, minimap2 and pbmm2 output 'X' while ngmlr dosen't. We will fix this in the new version.

Many thanks to your advices.

songbowang125 / SVision-pro

Incorrect mismatch ratio #6