Questions about the outputs of TOGA "loss_summ_data.tsv" and "proteinAlignments.fa"

Hi, I want to extract a GFF file for subsequent analysis. I have used GTF files from Zoonomia before, but I found that the BUSCO score of the PEP extracted from this GTF file was only 76%. I have discussed this issue with you in another issue, and you suggested extracting the QUERY sequences from the proteinAlignments.fa file. This indeed improved the BUSCO score, but I have discovered some new issues. The QUERY sequences I extracted amount to nearly 50,000, whereas my understanding of this species suggests it should have only 20,000-30,000 genes.

When I want to consider extracting only I or PI based on loss_summ_data.tsv, which category should I choose? I tried extracting "PROJECTION," "GENE," and "TRANSCRIPT," but I found the numbers of I and PI extracted were inconsistent, which confused me. If I want to extract genes from TOGA's output files that can be used as annotation files for subsequent analysis, what method should I choose? Should I extract QUERY from alignment.pep.fa and then overlap it with I and PI genes in the "PROJECTION" category of loss_summ_data.tsv?（the overlap is 35271 in above specie, it's still too more）

hillerlab / TOGA

Questions about the outputs of TOGA "loss_summ_data.tsv" and "proteinAlignments.fa" #170