Closed ftian9 closed 4 months ago
Actually, this mismatch ratio reflects the degree of impact of alignment artifacts (e.g. perhaps caused by segDups). Therefore, we used the ‘total_count / ref_span’ as the mismatch ratio. If you visualize an artifact alignment region using IGV, you might find there were massive mismatches or short indels (1-2bp). Moreover, the denominator is the reference span of the read alignment, not the read length. Therefore, this kind of ‘total_count / ref_span’ works to find alignment artifacts.
It's true ‘total_size / ref_span’ also seems to be right, while what if there is a real but large deletion (or insertion) in a read? That will make the ‘total_size / ref_span’ high, and we would mistakenly considered this read as a alignment artifact.
It true that not all aligners output 'X'. For example, among the three most frequently used aligners, minimap2 and pbmm2 output 'X' while ngmlr dosen't. We will fix this in the new version.
Many thanks to your advices.
https://github.com/songbowang125/SVision-pro/blob/02c935480e869adc92ec94603669aa4466699269/src/cigar_op.py#L469
Currently it is the ratio of mismatch or indel occurence time to aligned reference length.
It should be the ratio of mismatch or indel total size to aligned reference length:
More importantly, many aligners cannot output "X" in CIGAR for mismatches. So it is better to incorporate the MD tag to calculate mismatch ratios accurately.