schneebergerlab / syri

Synteny and Rearrangement Identifier
https://schneebergerlab.github.io/syri/
MIT License
303 stars 36 forks source link

nearly identical HDRs #236

Closed mparker2 closed 1 month ago

mparker2 commented 4 months ago

Hey @mnshgl0110 ,

I ran syri to compare the Ishikawa accession from this dataset against the TAIR10 Col-0 genome. For whatever reason there is a ~80 bp break in the minimap2 alignment on Chromosome 2 inside a transposable element AT2G06245. However, the unaligned region actually only differs by a single SNP (4th base from the end of the HDR sequence): ishikawa_hdr

Chr2 2448336 HDR429 ATCCGTCCCAATACTTCGTCGTTTTAAGTATTAGGGTTTCGGAATATTTGGCTATAAGTAGCATGTACTTCACATTTTCGCAA ATCCGTCCCAATACTTCGTCGTTTTAAGTATTAGGGTTTCGGAATATTTGGCTATAAGTAGCATGTACTTCACATTTTCACAA . PASS END=2448418;ChrB=Chr2;StartB=2707475;EndB=2707557;Parent=SYN41;VarType=ShV;DupType=.

This leads to some unexpected results with downstream tools (e.g. bcftools treats it as a SNP but cellsnp-lite does not).

Although the issue really stems from the minimap2 alignment, it would be quite easy for syri to catch it by performing an alignment of the HDR sequences. I think many HDRs could be resolved this way. Is this something you've considered?

all the best Matt

mnshgl0110 commented 4 months ago

Hi Matt, we have seen cases where the HDRs are caused by sequence divergence towards the end of the annotated HDR, while the central regions are mostly similar, but this one with only 1bp difference is indeed the most extreme example I have seen. This would be a question for the minimap2 developers though. Syri, currently, do not align/realign genomic regions so it would be difficult to handle the issue using it.

mparker2 commented 4 months ago

Agreed its a minimap2 issue really.

I understand that you do not want to add any sequence alignment functionality to syri, which makes sense. I think in this case the reason why it causes unexpected downstream results is because the two HDRs are of identical length, so some tools treat it as an "multi-nucleotide polymorphism". It seems like bcftools for example is actually performing a comparison on the two HDR seqs and deciding it is actually a SNP, hence why it does not get filtered out by bcftools view -i 'TYPE="snp"'.

Perhaps for identical length HDRs syri could perform a comparison and resolve any examples like this one, that only differ by one base, to a SNP. This wouldn't require any alignment, just an edit distance comparison.

BTW this is the script I wrote that does the alignment of HDR seqs. In a lot of cases it actually does resolve the HDRs into SNPs and Indels that seem reasonable.

mparker2 commented 1 month ago

fyi I have found that many of the HDRs can be closed to produce insertions or deletions by increasing the -z parameter of minimap2 (with a cost to alignment speed).