songbowang125 / SVision-pro

GNU General Public License v3.0
33 stars 3 forks source link

Incorrect mismatch ratio #6

Closed ftian9 closed 4 months ago

ftian9 commented 4 months ago

https://github.com/songbowang125/SVision-pro/blob/02c935480e869adc92ec94603669aa4466699269/src/cigar_op.py#L469

Currently it is the ratio of mismatch or indel occurence time to aligned reference length.

It should be the ratio of mismatch or indel total size to aligned reference length:

def cal_align_mismatch_ratio(cigar_tuples, align_ref_start, align_ref_end):
    op_counter = {8: 0, 1: 0, 2: 0}
    # 8 denotes "X"; See https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment.cigartuples
    for ele_type, ele_size in cigar_tuples:
        if ele_type not in op_counter:
            op_counter[ele_type] = 0
        op_counter[ele_type] += ele_size
    return (op_counter[8] + op_counter[1] + op_counter[2]) / (align_ref_end - align_ref_start)

align_mismatch_ratio = cal_align_mismatch_ratio(align.cigartuples, align.reference_start, align.reference_end)

More importantly, many aligners cannot output "X" in CIGAR for mismatches. So it is better to incorporate the MD tag to calculate mismatch ratios accurately.

songbowang125 commented 4 months ago

Actually, this mismatch ratio reflects the degree of impact of alignment artifacts (e.g. perhaps caused by segDups). Therefore, we used the ‘total_count / ref_span’ as the mismatch ratio. If you visualize an artifact alignment region using IGV, you might find there were massive mismatches or short indels (1-2bp). Moreover, the denominator is the reference span of the read alignment, not the read length. Therefore, this kind of ‘total_count / ref_span’ works to find alignment artifacts.

It's true ‘total_size / ref_span’ also seems to be right, while what if there is a real but large deletion (or insertion) in a read? That will make the ‘total_size / ref_span’ high, and we would mistakenly considered this read as a alignment artifact.

It true that not all aligners output 'X'. For example, among the three most frequently used aligners, minimap2 and pbmm2 output 'X' while ngmlr dosen't. We will fix this in the new version.

Many thanks to your advices.