schneebergerlab / syri

Synteny and Rearrangement Identifier
https://schneebergerlab.github.io/syri/
MIT License
303 stars 36 forks source link

Consult DUP/INVDP for details #248

Open Gene2love opened 2 months ago

Gene2love commented 2 months ago

Hi, @mnshgl0110

I have some questions regarding DUP/INVDP. In the syri.out file, DUP/INVDP are described as copyloss or copygain, but I find that their sizes are the same in both the reference and query genomes, as shown:

Chr1 1291592 1292064 - - Chr1 1030293 1030767 DUP21289 - DUP copygain ref_length=472 ; query_length=474.

Regarding copygain, I have examined the issue 102. Based on the position information provided by DUP/INVDP, I can locate the duplicated region in the query genome (i.e., Chr1 1030293 1030767), but I did not find another duplicate in the query genome because the query sequence extracted from the reference genome's position (i.e., Chr1 1291592 1292064) is completely different from the duplicated region's sequence. Here are the details:

Chr1 1291592 1292064 - - Chr1 1030293 1030767 DUP21289 - DUP copygain ref_length=472 ; query_length=474

For copyloss, can it also be understood that there are two duplicated regions in the reference genome, while there is only one in the query genome (assuming the sequences of the duplicated regions are consistent)? If so, how can I determine the position of the other duplicated region in the reference genome? As follows:

Chr1 275130 275326 - - Chr1 242876 243072 DUP21277 - DUP copyloss

Lastly, I noticed that DUP/INVDP types appear on two different chromosomes, as follows:

Chr1 233383 234314 - - Chr11 25410508 25411440 INVDP27781 - INVDP copygain Chr1 233383 234316 - - Chr10 12158992 12159906 DUP27782 - DUP copygain Chr1 1000 1113 - - Chr10 22719799 22719912 INVDP27766 - INVDP copyloss Chr1 43171 43541 - - Chr9 4856993 4857364 DUP27768 - DUP copyloss

Why are these not classified as translocations instead of duplications?

Looking forward to your reply and wishing you all the best in your work! Best regards, Moon

mnshgl0110 commented 2 months ago

Hi Moon,

but I find that their sizes are the same in both the reference and query genomes

Yes, that is the expected behavior. The DUP:copygain annotation mean that the described reference region is duplicated to the described query region. The reference region would have another copy in the query genome. That copy could be syntenic, inverted, or translocated in the query genome. Further, one reference region could have multiple duplicated copies in the query genome.

the reference genome's position (i.e., Chr1 1291592 1292064) is completely different from the duplicated region's sequence

Do you mean that the sequence ref:Chr1-1291592-1292064 and qry:Chr1:1030293-1030767 are completely different? Have you checked the input alignment file to check whether these region align?

For copyloss, can it also be understood that there are two duplicated regions in the reference genome, while there is only one in the query genome (assuming the sequences of the duplicated regions are consistent)?

Yes.

If so, how can I determine the position of the other duplicated region in the reference genome?

You can search for the rows that overlap the query coordinates. Syri has a reganno script that can be used for this.

Why are these not classified as translocations instead of duplications?

There would be another copy that is syntenic, inverted, or translocated. You can also check supplementary figure S4-S7 for explanation on for duplication identification works.

I hope this helps.

Gene2love commented 1 month ago

Sorry for the late reply. Thank you for your reply and I will read your suggestions carefully. With best wishes