schneebergerlab / syri

Synteny and Rearrangement Identifier
https://schneebergerlab.github.io/syri/
MIT License
305 stars 36 forks source link

REF and ALT start positions do not match #186

Closed zzbbf123 closed 1 year ago

zzbbf123 commented 1 year ago

I'm confused about this situation found in the VCF file, the column "REF" is N and the column "ALT" is \<DEL>. DeChr1 169713 DEL5 N <DEL> . PASS END=169920;ChrB=AlChr1;StartB=796674;EndB=796674;Parent=SYN13;VarType=ShV;DupType=.

I know this can happens when the length of DEL/INS is greater than 100, as discussed in issue 132. https://github.com/schneebergerlab/syri/issues/132

However, when I extracted sequences from the reference genome, things got weird and the sequences in genome A did not match the sequences in genome B at the start position (C ≠ A).

DeChr1:169713-169920 
CGTTTAACTTTATTATTGTATGGATCAGGCCGTGTTTTGGCCACTGAGTTGTAAAACAATTTTTCCTTGGCTTTTATCTCCCCCCCCCCCTTTCTTTTTATTTAATATCATTAATGCTCTCTCTTTTCTTTTCTTTTCTTTTTATATACTTGTTTACGATGGTTCTTGGTAAATTTATGATTAATTTGACAAGTATTCACAAGTGTTA
AlChr1: 796674
A

This is just one of the small cases.. I don't know what to make of this situation. Should this type of sv be deleted? Or How do I fix this problem?

mnshgl0110 commented 1 year ago

The coords in VCF are 1-based and inclusive. If you are using some BED based sequence extraction method then you can see this error. In that case, try getting sequence for "DeChr1:169712-169920" and "AlChr1: 796673-796674"