nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 85 forks source link

Gene products labeled as "hypothetical proteins" even though intermediate files have product names #765

Open ceanothus opened 2 years ago

ceanothus commented 2 years ago

Hi there,

I am using funannotate v.1.8.11 with conda.

I have a set of gene models for a draft fungal genome that I received from BRAKER2 and refined with PASA; I decided to use funannotate for the functional annotation with InterProScan, EggNOG Mapper, antiSMASH, and SignalP results. Here is my command:

funannotate annotate \ --gff GH37_braker_pasa_refined.gff3 \ --fasta GH37_assembly_final_smasked.fasta \ -s "Marasmius tenuissimus" \ --isolate GH37 \ -o funannotate_annotate2 \ --eggnog GH37_assembly_final_emapper.annotations_copy.tsv \ --antismash GH37_assembly_final_smasked_antismash_copy.gbk \ --iprscan GH37_braker_pasa_refined_interproscan_copy.xml \ --signalp GH37_braker_pasa_refined_protein_signalp_copy.txt \ --cpus 16

Unfortunately when I ran it, the vast majority of the proteins were labeled as "hypothetical proteins" even though in the annotate_misc folder, names for many gene products were found. I've taken a look at #364 but I'm not sure how to fix the issue.

My input gff looks like this:

scaffold04_8 PASA gene 2282 3144 . - . ID=Mten.g00001;Name=Mten.g00001; scaffold04_8 PASA mRNA 2282 3144 . - . ID=Mten.g00001.m01;Name=Mten.g00001.m01;Parent=Mten.g00001; scaffold04_8 PASA three_prime_UTR 2282 2471 . - . ID=Mten.g00001.m01.3UTR01;Name=Mten.g00001.m01.3UTR01;Parent=Mten.g00001.m01; scaffold04_8 PASA exon 2282 2815 . - . ID=Mten.g00001.m01.exon01;Name=Mten.g00001.m01.exon01;Parent=Mten.g00001.m01; scaffold04_8 PASA CDS 2472 2815 . - 2 ID=Mten.g00001.m01.CDS01;Name=Mten.g00001.m01.CDS01;Parent=Mten.g00001.m01; scaffold04_8 PASA exon 2893 3043 . - . ID=Mten.g00001.m01.exon02;Name=Mten.g00001.m01.exon02;Parent=Mten.g00001.m01; scaffold04_8 PASA CDS 2893 3016 . - 0 ID=Mten.g00001.m01.CDS02;Name=Mten.g00001.m01.CDS02;Parent=Mten.g00001.m01; scaffold04_8 PASA five_prime_UTR 3017 3043 . - . ID=Mten.g00001.m01.5UTR01;Name=Mten.g00001.m01.5UTR01;Parent=Mten.g00001.m01; scaffold04_8 PASA exon 3095 3144 . - . ID=Mten.g00001.m01.exon03;Name=Mten.g00001.m01.exon03;Parent=Mten.g00001.m01; scaffold04_8 PASA five_prime_UTR 3095 3144 . - . ID=Mten.g00001.m01.5UTR02;Name=Mten.g00001.m01.5UTR02;Parent=Mten.g00001.m01;

EggNOG Mapper tsv:

image

antiSMASH gbk:

image

InterProScan xml:

image

funannotate annotate GFF3 output:

image

Do you have any solutions for the problem? Please let me know what other files you would need to see. I really appreciate the help.

hyphaltip commented 2 years ago

This mostly has to do with level of trust one should give to these functional predictions by homology inference. Jon has commented on this before but it’s problematic to over assign when it’s just a distant similarity hit and interpro domains alone are insufficient to make a product description.

Generally i think it’s stringent similarly match to swissprot proteins that get promoted to product ? Though I def have to chase poorly formatted product names that get transferred from eggnog so I’m not totally sure of the order of preference lately.

hyphaltip commented 2 years ago

Unless this is a problem with how IDs are coming in from braker and then processed for the functional prediction. Your gene / Locus ids are not conforming to what funannotate uses to match to NCBI expectations of prefix_number

Are all of your products hypothetical or just a lot of then?

ceanothus commented 2 years ago

Thanks for the quick response. The vast majority of them are labelled as hypothetical (roughly ~23500 / ~24150). I was thinking it may have to do with the locus ids, but I'm not sure how to fix them if so. Right now their format is Mten.g00001 for genes, Mten.g00001.m01 for mRNAs and their proteins, and Mten.g00001.m01.exon1/cds1. Do you have any suggestions about what I should do?

pooranis commented 2 years ago

I have the same issue, and I think it is related to the Braker IDs. I have a locus tag prefix from NCBI, and I can write a script to modify the inputs so the IDs conform properly to what funannotate/NCBI uses. @hyphaltip could you give an example format of what the funannotate code expects?

I have tried just doing locustagprefix_XXX where XXX is a combo of the scaffold id and a unique number, as this is what I see in GenBank files downloaded from NCBI, but that didn't work (maybe my scaffold IDs are not a good format?).

nextgenusfs commented 2 years ago

This is defined https://www.ncbi.nlm.nih.gov/genomes/locustag/Proposal.pdf.

What you download from NCBI is not what is used for submission. But there were also a whole bunch of legacy genomes in NCBI before they made these rules. It's really simple, ie the default funannotate is FUN_000001, etc.

hyphaltip commented 2 years ago

I'm not sure. Usually I run predict with input models not in the annotate step but the expected format locusprefix_xxxx which are using unique throughout and not some indication of contig name. Do you know how to just renumber your gff for this. Otherwise why not import this gff jnto predict step?