nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
314 stars 83 forks source link

Annotations not being added properly in annotate step, and proteins throwing errors #840

Closed mmotoc closed 1 year ago

mmotoc commented 1 year ago

Are you using the latest release? Yes, w/ python 3.8.12

Describe the bug Annotations don't seem to be adding (ex: diptera busco -> 0 annotations added), and some proteins are throwing errors. Not sure how to proceed

What command did you issue?

funannotate annotate --sbt template.sbt --gff Braula_annotation.gff3 --fasta BraulaGenome.fasta -s "Braula coeca" --eggnog eggNOG_annotations.tabular --iprscan Interproscan_braula.xml --busco_db diptera -o Functional_Annotation


[Nov 15 10:05 AM]: OS: Debian GNU/Linux 10, 5 cores, ~ 10 GB RAM. Python: 3.8.12 [Nov 15 10:05 AM]: Running 1.8.13 [Nov 15 10:05 AM]: Found existing output directory Functional_Annotation. Warning, will re-use any intermediate files found. [Nov 15 10:05 AM]: Parsing annotation and preparing annotation files. [Nov 15 10:06 AM]: Found 13,521 gene models from GFF3 annotation [Nov 15 10:18 AM]: Adding Functional Annotation to Braula coeca, NCBI accession: None [Nov 15 10:18 AM]: Annotation consists of: 13,521 gene models [Nov 15 10:18 AM]: 11,093 protein records loaded [Nov 15 10:18 AM]: Existing Pfam-A results found: Functional_Annotation/annotate_misc/annotations.pfam.txt [Nov 15 10:18 AM]: 3 annotations added [Nov 15 10:18 AM]: Running Diamond blastp search of UniProt DB version 2022_04 [Nov 15 10:18 AM]: 0 valid gene/product annotations from 1 total [Nov 15 10:18 AM]: Existing Eggnog-mapper results found: Functional_Annotation/annotate_misc/eggnog.emapper.annotations [Nov 15 10:18 AM]: Parsing EggNog Annotations [Nov 15 10:18 AM]: EggNog version parsed as 2.1.8 [Nov 15 10:18 AM]: 30,895 COG and EggNog annotations added [Nov 15 10:18 AM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.84 [Nov 15 10:18 AM]: 6,958 gene name and product description annotations added [Nov 15 10:18 AM]: Running Diamond blastp search of MEROPS version 12.0 [Nov 15 10:18 AM]: 0 annotations added [Nov 15 10:18 AM]: Annotating CAZYmes using HMMer search of dbCAN version 11.0 [Nov 15 10:25 AM]: 0 annotations added [Nov 15 10:25 AM]: Annotating proteins with BUSCO diptera models [Nov 15 10:42 AM]: 0 annotations added [Nov 15 10:42 AM]: Skipping phobius predictions, try funannotate remote -m phobius [Nov 15 10:42 AM]: Skipping secretome: neither SignalP nor Phobius searches were run [Nov 15 10:42 AM]: 0 secretome and 0 transmembane annotations added [Nov 15 10:42 AM]: Parsing InterProScan5 XML file [Nov 15 10:43 AM]: Found 0 duplicated annotations, adding 109,297 valid annotations [Nov 15 10:44 AM]: Converting to final Genbank format, good luck! ('ERROR', 'FUN_000767', {'name': None, 'type': 'mRNA', 'transcript': [''], 'cds_transcript': [''], 'protein': [], '5UTR': [[]], '3UTR': [[]], 'codon_start': [1], 'ids': ['FUN_000767-T1'], 'CDS': [[(235753, 235846), (239343, 239397), (240277, 240342), (240423, 240522)]], 'mRNA': [[(235753, 235846), (239343, 239397), (240277, 240342), (240423, 240522)]], 'strand': '+', 'gene_synonym': [], 'location': (235753, 240522), 'contig': 'contig_2', 'product': ['hypothetical protein'], 'source': 'funannotate', 'phase': [], 'db_xref': [[]], 'go_terms': [[]], 'EC_number': [[]], 'note': [[]], 'partialStart': [False], 'partialStop': [False], 'pseudo': False})

OS/Install Information

You are running Perl v b'5.026002'. Now checking perl modules... Carp: 1.38 Clone: 0.42 DBD::SQLite: 1.64 DBD::mysql: 4.046 DBI: 1.642 DB_File: 1.855 Data::Dumper: 2.173 File::Basename: 2.85 File::Which: 1.23 Getopt::Long: 2.5 Hash::Merge: 0.300 JSON: 4.02 LWP::UserAgent: 6.39 Logger::Simple: 2.0 POSIX: 1.76 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.12 Tie::File: 1.02 URI::Escape: 3.31 YAML: 1.29 local::lib: 2.000029 threads: 2.15 threads::shared: 1.56 All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/data/funannotate_db $PASAHOME=/opt/conda/opt/pasa-2.5.2 $TRINITYHOME=/opt/conda/opt/trinity-2.8.5 $EVM_HOME=/opt/conda/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/opt/conda/config $GENEMARK_PATH=/data/external/gm_et_linux_64 All 6 environmental variables are set

hyphaltip commented 1 year ago

can you share the annotations.all.txt file from annotate_misc ? or maybe an archive of your annotate_misc folder? Do the BUSCO results in the annotate_misc indeed find any hits to promote to product annotation?

mmotoc commented 1 year ago

annotations.busco.txt

this is from a previous run, but I suspect today's annotation run produced the same empty file.

My files are clearly being read, but annotations don't seem to be adding, which is troubling as I am annotating a relative of drosophila that should easily find diptera analogs.

hyphaltip commented 1 year ago

Ok. Attaching an empty file doesn't give much info. The raw Busco output not the failed parsing would be helpful such as info in the Busco folder in annotate_misc would be more informative. Without error messages or details it js hard to give suggestions. Did you look at the logfiles folder too. It sounds like maybe Busco is not running at all? Do you have a

nextgenusfs commented 1 year ago

You can check your install with funannotate test -t annotate --debug.

But this might be caused by some locus_tag names that are incompatible, what do the protein fasta headers look like that its using to run these steps, ie grep '^>' annotate_misc/genome.proteins.fa | head -n 10

This doesn't make any sense, ie you should have many more annotations than that.

[Nov 15 10:18 AM]: 11,093 protein records loaded
[Nov 15 10:18 AM]: Existing Pfam-A results found: Functional_Annotation/annotate_misc/annotations.pfam.txt
[Nov 15 10:18 AM]: 3 annotations added
mmotoc commented 1 year ago

Hello,

thanks for the response. The out put of grep '^>' annotate_misc/genome.proteins.fa | head -n 10 is:

(base) michael@Michaels-MacBook-Pro-2 funannotate % grep '^>' /Users/michael/funannotate/annotation_files/Functional_Annotation/annotate_misc/genome.proteins.fa | head -n 10

FUN_000747-T1 FUN_000747 FUN_000748-T1 FUN_000748 FUN_000749-T1 FUN_000749 FUN_000750-T1 FUN_000750 FUN_000751-T1 FUN_000751 FUN_000753-T1 FUN_000753 FUN_000754-T1 FUN_000754 FUN_000755-T1 FUN_000755 FUN_000758-T1 FUN_000758 FUN_000759-T1 FUN_000759

It does seem that the Busco is not running at all, as the hmmer_output folder is filled with empty files called something along the lines of EOG09150029.out.1.

Any recommendations?

mmotoc commented 1 year ago

As an update:

I switched from using the gff3 file for the annotation to the gtf, and the results seem a bit better.

[Nov 29 11:06 AM]: OS: Debian GNU/Linux 10, 5 cores, ~ 10 GB RAM. Python: 3.8.12 [Nov 29 11:06 AM]: Running 1.8.13 [Nov 29 11:06 AM]: Found existing output directory Functional_Annotation. Warning, will re-use any intermediate files found. [Nov 29 11:06 AM]: Checking GenBank file for annotation [Nov 29 11:22 AM]: Adding Functional Annotation to Braula coeca, NCBI accession: None [Nov 29 11:22 AM]: Annotation consists of: 25,344 gene models [Nov 29 11:22 AM]: 20,861 protein records loaded [Nov 29 11:22 AM]: Existing Pfam-A results found: Functional_Annotation/annotate_misc/annotations.pfam.txt [Nov 29 11:22 AM]: 3 annotations added [Nov 29 11:22 AM]: Running Diamond blastp search of UniProt DB version 2022_04 [Nov 29 11:22 AM]: 0 valid gene/product annotations from 1 total [Nov 29 11:22 AM]: Existing Eggnog-mapper results found: Functional_Annotation/annotate_misc/eggnog.emapper.annotations [Nov 29 11:22 AM]: Parsing EggNog Annotations [Nov 29 11:22 AM]: EggNog version parsed as 2.1.8 [Nov 29 11:22 AM]: 30,895 COG and EggNog annotations added [Nov 29 11:22 AM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.84 [Nov 29 11:22 AM]: 6,958 gene name and product description annotations added [Nov 29 11:22 AM]: Running Diamond blastp search of MEROPS version 12.0 [Nov 29 11:22 AM]: 0 annotations added [Nov 29 11:22 AM]: Existing CAZYme results found: Functional_Annotation/annotate_misc/annotations.dbCAN.txt [Nov 29 11:22 AM]: 211 annotations added [Nov 29 11:22 AM]: Annotating proteins with BUSCO diptera models [Nov 30 12:24 AM]: 2,463 annotations added [Nov 30 12:24 AM]: Skipping phobius predictions, try funannotate remote -m phobius [Nov 30 12:24 AM]: Skipping secretome: neither SignalP nor Phobius searches were run [Nov 30 12:24 AM]: 0 secretome and 0 transmembane annotations added [Nov 30 12:24 AM]: Parsing InterProScan5 XML file [Nov 30 12:25 AM]: Found 0 duplicated annotations, adding 111,971 valid annotations [Nov 30 12:25 AM]: Converting to final Genbank format, good luck! [Nov 30 01:01 AM]: Creating AGP file and corresponding contigs file [Nov 30 01:04 AM]: Writing genome annotation table. [Nov 30 01:14 AM]: Funannotate annotate has completed successfully!

    We need YOUR help to improve gene names/product descriptions:
       0 gene/products names MUST be fixed, see Functional_Annotation/annotate_results/Gene2Products.must-fix.txt
       147 gene/product names need to be curated, see Functional_Annotation/annotate_results/Gene2Products.need-curating.txt
       751 gene/product names passed but are not in Database, see Functional_Annotation/annotate_results/Gene2Products.new-names-passed.txt

    Please consider contributing a PR at https://github.com/nextgenusfs/gene2product

NOTE: while it seems some of the BUSCO annotations are being added, 2k is much less than I would have expected. Not sure how to improve these numbers for another run.

hyphaltip commented 1 year ago

closing this but reopen if you have specific issue still to report or fix.