schneebergerlab / syri

Synteny and Rearrangement Identifier
https://schneebergerlab.github.io/syri/
MIT License
303 stars 36 forks source link

microhomology at ends of insertions #249

Closed mparker2 closed 1 month ago

mparker2 commented 1 month ago

Hi @mnshgl0110,

I have noticed that minimap2 often breaks alignments at insertions that have microhomology either side of them. This results in two alignments that overlap at the homologous ends. Syri seems to do different things with these overlapping alignments in different places. Sometimes the insertion is rescued despite not being in the alignment, other times the insertion is not rescued and the microhomologous overlap of the two alignments is called as a CPG. Could you please explain how the decision is made either way? Is it just about the length of the overlap?

mnshgl0110 commented 1 month ago

Is it just about the length of the overlap?

Yes. It is possible to adjust the length of the overlap, before the region is annotated as CPG/CPL, using the --allow-offset parameter.

mparker2 commented 1 month ago

aha ok, I understand now, thanks. So now I am also wondering why these insertions and CPGs are mutually exclusive? For example, by fiddling with the Z-drop parameters of minimap2 I can get it to align through some of these breaks, and turn the overlapping broken alignments into signle alignments with insertions: image

In the above example, I've run syri on the first WGA using the default offset parameter of >5. Syri annotates the overlapping region as a CPG but does not include the insertion. By increasing the Z-drop threshold of minimap2 I get a (probably slightly better) alignment where the microhomologous duplication and the unaligned insertion are included as one left-aligned insertion.

image

In the second example, because the microhomology is shorter, syri annotates an insertion, and ignores the overlap. But the insertion does not match that produced by minimap2 when I increase Z-drop threshold, for two reasons: the 5bp of duplicated overlap are missing from the insertion, and the insertion is not left-aligned.

mnshgl0110 commented 1 month ago

The idea is to analyse the overlap between adjacent alignments to find the SV in the region. Syri reports one SV per pair of alignments, as such the CPG and insertions are mutually exclusive. Indeed, for such regions the alignments and annotations are easily influenced by the selected parameters and all your observations and inferences are correct. I guess, you would need to figure out the set of parameters that work best for your downstream analysis. Though, I would predict that no parameter would give "perfect" result throughout the genome and there would always be some sub-optimal results.

mparker2 commented 1 month ago

I understand your logic but making these CPG and insertions mutually exclusive does not make sense to me from a biological point of view since it results in information loss. I would suggest minimally that the implementation of insertions and deletions could be altered so that the positions are left-aligned and the longer allele includes the duplicated microhomologous region. Then one could set the offset parameter to an arbitrarily high number to capture all these overlapping alignments as indels that properly represent the underlying genomic sequences without information loss. Personally I do not find the CPG/CPL annotations so useful so would be happy to forgo them for more accurate indels.

Z-drop parameter tuning of minimap2 seems to help but is not a perfect solution because setting it too high causes minimap2 to incorrectly force poor syntenic alignments of small inversions.

I will close this issue now and add this suggestion to the discussion page.