hillerlab / TOGA

TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
MIT License
147 stars 22 forks source link

Questions about the outputs of TOGA "loss_summ_data.tsv" and "proteinAlignments.fa" #170

Open SWei2333 opened 1 month ago

SWei2333 commented 1 month ago

Hi, I want to extract a GFF file for subsequent analysis. I have used GTF files from Zoonomia before, but I found that the BUSCO score of the PEP extracted from this GTF file was only 76%. I have discussed this issue with you in another issue, and you suggested extracting the QUERY sequences from the proteinAlignments.fa file. This indeed improved the BUSCO score, but I have discovered some new issues. The QUERY sequences I extracted amount to nearly 50,000, whereas my understanding of this species suggests it should have only 20,000-30,000 genes.

When I want to consider extracting only I or PI based on loss_summ_data.tsv, which category should I choose? I tried extracting "PROJECTION," "GENE," and "TRANSCRIPT," but I found the numbers of I and PI extracted were inconsistent, which confused me. If I want to extract genes from TOGA's output files that can be used as annotation files for subsequent analysis, what method should I choose? Should I extract QUERY from alignment.pep.fa and then overlap it with I and PI genes in the "PROJECTION" category of loss_summ_data.tsv?(the overlap is 35271 in above specie, it's still too more)

MichaelHiller commented 1 month ago

Hi,

~50000 will likely refer to transcripts we annotate in the query. Not genes. Most genes have several transcripts. And some transcripts have several orthologous loci (called projections).

We often consider I, PI and UL, but I and PI is also fine. Depending on how strict you want to be. Pls have a look if this https://github.com/alejandrogzi/postoga is helpful to process and filter the TOGA output.