heche-psb / wgd

wgd v2: a suite of tools to uncover and date ancient polyploidy and whole-genome duplication
https://wgdv2.readthedocs.io/en/latest/
GNU General Public License v3.0
21 stars 0 forks source link

Can Orthologous Isoforms be identified? #26

Closed Tang-pro closed 3 months ago

Tang-pro commented 3 months ago

Hi, @heche-psb

Generally, homologous gene pairs are identified in comparative genomes, but here I use the full-length transcriptome to identify homologous isoforms of two species. Is it possible to do this?

Best wishes!

heche-psb commented 3 months ago

Hi, homologous isoforms are meant for the same gene with different transcripts originated from alternative splicing. Homologous isoforms are supposed to be sequentially highly similar, typically manifested as the ~0 Ks bar in a conventional Ks plot. Usually I use CD-HIT to drop isoforms before making Ks distribution for transcriptome assembly, by which you may also try different cut-offs to identify isoforms. Note that this question is not about the software wgd v2 itself but the data preparation. Another way to identify isoforms is to simply use the clustering results from wgd dmd and the resultant diamond hit table to perform similar filtering as CD-HIT does, for instance, you can filter out transcripts with normalized similiarity scores higher than 0.95 compared to other members in the same cluster (i.e., the deduced gene family) while retain only the longest one. This way you can achieve the same job as CD-HIT while using less gene length-biased similiarity scores.

Tang-pro commented 3 months ago

Hi, @heche-psb Here I want to identify the conserved isoforms of two species. If cd-hit clustering is used, the differences in isoforms cannot be reflected. So I want to use software specifically designed to identify alternative splicing to extract the different isoform sequences of each gene. I have a question here. Isn't WGD itself also an alignment of gene sequences? If I use these isoform sequence alignments, is this solution feasible? It is difficult to compare Isoforms within species, but what about between species? Is it feasible to compare Isoforms between two species? Thank you!

heche-psb commented 3 months ago

Hi, in view of "the conserved isoforms of two species", you can achieve it by two means. The first is with 2 steps, 1) identifying the isoforms per species + 2) comparing the obtained isoforms between the two species. The second is to jointly identify the isoforms per species and conserved isoforms between the two species based on the sequence clustering result and similarity matrix. wgd v2 is not specifically designed for this purpose. But you may have a try to calculate the gene length-normalized similarity scores first and then write some custom scripts to retrieve the the conserved isoforms.

Tang-pro commented 3 months ago

Hi, @heche-psb

Thank you so much for taking the time to reply, it means a lot to me.