Closed mparker2 closed 1 month ago
Hi Matt, we have seen cases where the HDRs are caused by sequence divergence towards the end of the annotated HDR, while the central regions are mostly similar, but this one with only 1bp difference is indeed the most extreme example I have seen. This would be a question for the minimap2 developers though. Syri, currently, do not align/realign genomic regions so it would be difficult to handle the issue using it.
Agreed its a minimap2 issue really.
I understand that you do not want to add any sequence alignment functionality to syri, which makes sense. I think in this case the reason why it causes unexpected downstream results is because the two HDRs are of identical length, so some tools treat it as an "multi-nucleotide polymorphism". It seems like bcftools for example is actually performing a comparison on the two HDR seqs and deciding it is actually a SNP, hence why it does not get filtered out by bcftools view -i 'TYPE="snp"'
.
Perhaps for identical length HDRs syri could perform a comparison and resolve any examples like this one, that only differ by one base, to a SNP. This wouldn't require any alignment, just an edit distance comparison.
BTW this is the script I wrote that does the alignment of HDR seqs. In a lot of cases it actually does resolve the HDRs into SNPs and Indels that seem reasonable.
fyi I have found that many of the HDRs can be closed to produce insertions or deletions by increasing the -z
parameter of minimap2 (with a cost to alignment speed).
Hey @mnshgl0110 ,
I ran syri to compare the Ishikawa accession from this dataset against the TAIR10 Col-0 genome. For whatever reason there is a ~80 bp break in the minimap2 alignment on Chromosome 2 inside a transposable element AT2G06245. However, the unaligned region actually only differs by a single SNP (4th base from the end of the HDR sequence):![ishikawa_hdr](https://github.com/schneebergerlab/syri/assets/15102604/8ecea9db-8207-421a-bb7b-aa071b530904)
Chr2 2448336 HDR429 ATCCGTCCCAATACTTCGTCGTTTTAAGTATTAGGGTTTCGGAATATTTGGCTATAAGTAGCATGTACTTCACATTTTCGCAA ATCCGTCCCAATACTTCGTCGTTTTAAGTATTAGGGTTTCGGAATATTTGGCTATAAGTAGCATGTACTTCACATTTTCACAA . PASS END=2448418;ChrB=Chr2;StartB=2707475;EndB=2707557;Parent=SYN41;VarType=ShV;DupType=.
This leads to some unexpected results with downstream tools (e.g. bcftools treats it as a SNP but cellsnp-lite does not).
Although the issue really stems from the minimap2 alignment, it would be quite easy for syri to catch it by performing an alignment of the HDR sequences. I think many HDRs could be resolved this way. Is this something you've considered?
all the best Matt