nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
314 stars 83 forks source link

COG annotations not fetching from Eggnog web version results #551

Closed athulmenon closed 3 years ago

athulmenon commented 3 years ago

Hi Jon,

Hope you are doing good.

We ran funannotate compare and were not able to generate COG annotations in the report. So I went through the Eggnog annotations which was generated using Eggnog mapper web server. I can see the column headers are missing for the COG annotation in the file, which may be causing the tool not able to fetch those results. Can you please fix it or can we fix this by adding the column headers manually (I am not sure what headers the tool will be looking for in the file)?

I am attaching a sample Eggnog result file. https://drive.google.com/file/d/1eKGjViwXYklJ_FwrxSw0wmeX7jBtRzEP/view?usp=sharing

We can also see GO terms enrichments, Interproscan plots and Fungal TFs are missing in the comparison report.

Here one species was followed through the funannotate prediction and annotation pipeline. All other species .gbk and protein files were downloaded from NCBI and annotated using funannotate annotation module. We can see the species which we have predicted and annotated have the GO terms, Fungal TFs but in all other species it's missing.

I am attaching the log file below, kindly let us know if we can fix this. https://drive.google.com/file/d/1SqXZUoOBIgobCC44_Qsv01znq1N7TAAZ/view?usp=sharing

Funannotate Compare ./funannotate-docker compare -i FEquiseti_Assembled/ FG_annotate_out/ FPsuedograminearum_annotation/ F Fuj_annotate/ FOxysporum_annotate/ FProliferatum_annotation/ FCoffeatum_annotation/ FSub_annotate_updated/ -o compare-fusariums_updated/ --cpus 10 logname: no login name logname: no login name

[Feb 15 05:56 AM]: OS: Debian GNU/Linux 10, 12 cores, ~ 74 GB RAM. Python: 3.7.9 [Feb 15 05:56 AM]: Running 1.8.4 [Feb 15 05:56 AM]: Now parsing 8 genomes [Feb 15 05:56 AM]: working on Fusarium equiseti [Feb 15 05:57 AM]: working on Fusarium graminearum PH-1 [Feb 15 05:57 AM]: working on Fusarium pseudograminearum CS3096 [Feb 15 05:58 AM]: working on Fusarium fujikuroi IMI 58289 [Feb 15 05:58 AM]: working on Fusarium oxysporum f. sp. lycopersici 4287 [Feb 15 05:59 AM]: working on Fusarium proliferatum ET1 [Feb 15 06:00 AM]: working on Fusarium coffeatum [Feb 15 06:00 AM]: working on Fusarium subglutinans [Feb 15 06:00 AM]: Summarizing secondary metabolism gene clusters [Feb 15 06:00 AM]: Summarizing PFAM domain results [Feb 15 06:00 AM]: Summarizing InterProScan results [Feb 15 06:00 AM]: Loading InterPro descriptions [Feb 15 06:00 AM]: Summarizing MEROPS protease results [Feb 15 06:00 AM]: found 33/104 MEROPS familes with stdev >= 1.000000 /venv/lib/python3.7/site-packages/funannotate/library.py:7865: MatplotlibDeprecationWarning: Calling add_axes() without argument is deprecated since 3.3 and will be removed two minor releases later. You may want to use add_subplot() instead. cbar_ax = fig.add_axes(shrink=0.4) [Feb 15 06:01 AM]: Summarizing CAZyme results [Feb 15 06:01 AM]: found 48/152 CAZy familes with stdev >= 1.000000 /venv/lib/python3.7/site-packages/funannotate/library.py:7865: MatplotlibDeprecationWarning: Calling add_axes() without argument is deprecated since 3.3 and will be removed two minor releases later. You may want to use add_subplot() instead. cbar_ax = fig.add_axes(shrink=0.4) [Feb 15 06:01 AM]: No COG annotations found [Feb 15 06:01 AM]: Summarizing secreted protein results [Feb 15 06:01 AM]: Summarizing fungal transcription factors /venv/lib/python3.7/site-packages/funannotate/library.py:7865: MatplotlibDeprecationWarning: Calling add_axes() without argument is deprecated since 3.3 and will be removed two minor releases later. You may want to use add_subplot() instead. cbar_ax = fig.add_axes(shrink=0.4) [Feb 15 06:01 AM]: Running GO enrichment for each genome WARNING: skipping Fusarium_coffeatum.txt as no GO terms WARNING: skipping Fusarium_fujikuroi_IMI_58289.txt as no GO terms WARNING: skipping Fusarium_graminearum_PH-1.txt as no GO terms WARNING: skipping Fusarium_oxysporum_f._sp._lycopersici_4287.txt as no GO terms WARNING: skipping Fusarium_proliferatum_ET1.txt as no GO terms WARNING: skipping Fusarium_pseudograminearum_CS3096.txt as no GO terms WARNING: skipping Fusarium_subglutinans.txt as no GO terms /venv/lib/python3.7/site-packages/funannotate/compare.py:803: FutureWarning: The default value of regex will change from True to False in a future version. df.columns = df.columns.str.replace(r'^# ', '') [Feb 15 06:03 AM]: Running orthologous clustering tool, ProteinOrtho. This may take awhile... [Feb 15 06:26 AM]: Compiling all annotations for each genome [Feb 15 06:27 AM]: Inferring phylogeny using RAxML [Feb 15 06:27 AM]: Found 1056 single copy BUSCO orthologs, will randomly select 500 to infer phylogeny [Feb 15 06:48 AM]: Compressing results to output file: compare-fusariums_updated.tar.gz [Feb 15 06:52 AM]: Funannotate compare completed successfully!

Thanks & Regards, Athul

OS/Install Information

Checking dependencies for 1.8.4

You are running Python v 3.7.9. Now checking python packages... biopython: 1.78 goatools: 1.0.15 matplotlib: 3.3.4 natsort: 7.1.1 numpy: 1.19.5 pandas: 1.2.1 psutil: 5.8.0 requests: 2.25.1 scikit-learn: 0.24.1 scipy: 1.5.3 seaborn: 0.11.1 All 11 python packages installed

You are running Perl v b'5.026002'. Now checking perl modules... Bio::Perl: 1.007002 Carp: 1.38 Clone: 0.42 DBD::SQLite: 1.64 DBD::mysql: 4.046 DBI: 1.642 DB_File: 1.855 Data::Dumper: 2.173 File::Basename: 2.85 File::Which: 1.23 Getopt::Long: 2.5 Hash::Merge: 0.300 JSON: 4.02 LWP::UserAgent: 6.15 Logger::Simple: 2.0 POSIX: 1.76 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.12 Tie::File: 1.02 URI::Escape: 3.31 YAML: 1.29 threads: 2.15 threads::shared: 1.56 All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/opt/databases $PASAHOME=/venv/opt/pasa-2.4.1 $TRINITYHOME=/venv/opt/trinity-2.8.5 $EVM_HOME=/venv/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/venv/config ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir

Checking external dependencies... Traceback (most recent call last): File "/venv/bin/ete3", line 6, in from ete3.tools.ete import main File "/venv/lib/python3.7/site-packages/ete3/tools/ete.py", line 55, in from . import (ete_split, ete_expand, ete_annotate, ete_ncbiquery, ete_view, File "/venv/lib/python3.7/site-packages/ete3/tools/ete_view.py", line 48, in from .. import (Tree, PhyloTree, TextFace, RectFace, faces, TreeStyle, CircleFace, AttrFace, ImportError: cannot import name 'TextFace' from 'ete3' (/venv/lib/python3.7/site-packages/ete3/init.py) PASA: 2.4.1 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.3.3 bamtools: bamtools 2.5.1 bedtools: bedtools v2.30.0 blat: BLAT v36 diamond: 2.0.6 exonerate: exonerate 2.4.0 fasta: no way to determine glimmerhmm: 3.0.4 gmap: 2017-11-15 hisat2: 2.2.1 hmmscan: HMMER 3.3.1 (Jul 2020) hmmsearch: HMMER 3.3.1 (Jul 2020) java: 11.0.8-internal kallisto: 0.46.1 mafft: v7.475 (2020/Nov/23) makeblastdb: makeblastdb 2.2.31+ minimap2: 2.17-r941 proteinortho: 6.0.16 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.10 snap: 2006-07-28 stringtie: 2.1.4 tRNAscan-SE: 2.0.7 (Oct 2020) tantan: tantan 13 tbl2asn: no way to determine, likely 25.X tblastn: tblastn 2.2.31+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: emapper.py not installed ERROR: ete3 not installed ERROR: gmes_petap.pl not installed ERROR: signalp not installed

athulmenon commented 3 years ago

Update

I went to Eggnog github and the issue of truncated headers is reported. I think they are trying to fix this, may be it will get updated in today's maintenance. I tried to name the headers with the names from their manual and tried to run the annotate module and comparison module. Still funannotate is not fetching those columns. Can you help with a fix?

Regarding GO, Interpro and TF results, I went through the annotation files and found out the protein sequence used directly from NCBI is having the id's starting with "XP_", but the Funannotate need to have id similar to locus id. I found a protein fasta which have similar ids in the annotate folder after running the annotate module, which I am trying to run through Interscanpro and waiting for the results. Thanks, Athul

nextgenusfs commented 3 years ago

The error here is probably at least two fold with eggnog mapper. The only version I can officially support right now is still v1.0.3 because they are running some ~v2.0 on website and the local version has never stabilized its output so I haven't built the parser to support it. The other issue is that v1.0.3 is python2 only, so it makes it hard/impossible to run in the same Conda environment. So short answer is that right now the only supported way to get eggnog mapper results in is to use v1.0.3 via a local installation. When eggnog mapper developers release a functioning v2.x version that has stabile file output format I'll update the parser to support, but I'm not going to waste my time again trying to deal with changing output files.

Per the NCBI genome issue -- let me pull one of those genomes you had and run through locally to see if I can find the problem -- I sort of know this was as issue but I thought it was corrected....

nextgenusfs commented 3 years ago

Okay, I think I see the problem or at least one of them. These genomes from NCBI in genbank format have both mRNA and then exon/intron fields (which is not the way that default tbl2asn writes them...), the parser was actually not reading this format properly causing the mRNA coordinates to be wrong (essentially getting overwritten by the last 'exon' field for each gene). Eventually this caused an error in the script (but the problem is pretty bad as all of the output files would be non functional with this format, ie the mRNA features would not match up with the coding features).

The other issue is perhaps how you are running IPRscan and eggnog, in other words which protein FASTA file would you be passing to those tools. So I'll push this update right now (it will take a few hours to build) but then I'd recommend the following:

  1. Run funannotate annotate on each NCBI genome first, then use the resulting protein FASTA file to run IPRScan and Eggnog mapper, then re-run the same funannotate annotate command and pass it your IPRScan xml file and your eggnog annotations file to --iprscan and --eggnog.
$ funannotate annotate --genbank GCF_900079805.1_Fusarium_fujikuroi_IMI58289_V2_genomic.gbff -o test-annotate-fixed --cpus 4
-------------------------------------------------------
[Feb 15 11:48 AM]: OS: MacOSX 10.14.6, 8 cores, ~ 17 GB RAM. Python: 3.7.6
[Feb 15 11:48 AM]: Running 1.8.4
[Feb 15 11:48 AM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[Feb 15 11:48 AM]: Checking GenBank file for annotation
[Feb 15 11:49 AM]: Adding Functional Annotation to Fusarium fujikuroi IMI 58289, NCBI accession: WGS:HF67
[Feb 15 11:49 AM]: Annotation consists of: 14,943 gene models
[Feb 15 11:49 AM]: 14,810 protein records loaded
[Feb 15 11:49 AM]: Running HMMer search of PFAM version 33.1
[Feb 15 11:59 AM]: 16,643 annotations added
[Feb 15 11:59 AM]: Running Diamond blastp search of UniProt DB version 2020_03
[Feb 15 12:01 PM]: 1,018 valid gene/product annotations from 1,388 total
[Feb 15 12:01 PM]: Install eggnog-mapper or use webserver to improve functional annotation: https://github.com/jhcepas/eggnog-mapper
[Feb 15 12:01 PM]: No Eggnog-mapper results found.
[Feb 15 12:01 PM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.62
[Feb 15 12:01 PM]: 1,018 gene name and product description annotations added
[Feb 15 12:01 PM]: Running Diamond blastp search of MEROPS version 12.0
[Feb 15 12:01 PM]: 466 annotations added
[Feb 15 12:01 PM]: Annotating CAZYmes using HMMer search of dbCAN version 8.0
[Feb 15 12:03 PM]: 623 annotations added
[Feb 15 12:03 PM]: Annotating proteins with BUSCO dikarya models
[Feb 15 12:04 PM]: 1,323 annotations added
[Feb 15 12:04 PM]: Skipping phobius predictions, try funannotate remote -m phobius
[Feb 15 12:04 PM]: Predicting secreted proteins with SignalP
[Feb 15 12:11 PM]: 1,465 secretome and 0 transmembane annotations added
[Feb 15 12:11 PM]: InterProScan error, test-annotate-fixed/annotate_misc/iprscan.xml is empty, or no XML file passed via --iprscan. Functional annotation will be lacking.
[Feb 15 12:11 PM]: Found 0 duplicated annotations, adding 22,556 valid annotations
[Feb 15 12:11 PM]: Detected NCBI reannotation, but couldn't locate p2g file, please pass via --p2g
[Feb 15 12:11 PM]: Converting to final Genbank format, good luck!
[Feb 15 12:13 PM]: Creating AGP file and corresponding contigs file
[Feb 15 12:13 PM]: Writing genome annotation table.
[Feb 15 12:15 PM]: Funannotate annotate has completed successfully!

        We need YOUR help to improve gene names/product descriptions:
           0 gene/products names MUST be fixed, see test-annotate-fixed/annotate_results/Gene2Products.must-fix.txt
           0 gene/product names need to be curated, see test-annotate-fixed/annotate_results/Gene2Products.need-curating.txt
           45 gene/product names passed but are not in Database, see test-annotate-fixed/annotate_results/Gene2Products.new-names-passed.txt

        Please consider contributing a PR at https://github.com/nextgenusfs/gene2product

-------------------------------------------------------
athulmenon commented 3 years ago

Sorry for the late reply. Yes I tested and it is working. I downgraded the eggnog version and ran the protein sequences which was generated from annotation folder and it worked! Thanks for the support.

Regards, Athul

hutchinsonmiri commented 3 years ago

Athul, can I ask what version of EggNog you downgraded to? I tried 1.0.3 but it doesn't work with the version of Diamond that Funannotate uses. Thanks!!

Miriam

nextgenusfs commented 3 years ago

You need to rebuild the diamond database for eggnog - the default format is too old. This is described a few times in eggnog mapper GitHub issues.

athulmenon commented 3 years ago

Hi Miriam,

I installed eggnog mapper 1.0.3 and ran independently to generate the result files. Later those outputs were feed into funannotate.

Athul

hutchinsonmiri commented 3 years ago

Thanks Jon and Athul! I did this (below) and it appears to be working:

download protein models

wget http://eggnogdb.embl.de/download/eggnog_4.5/eggnog-mapper-data/eggnog4.clustered_proteins.fa.gz

make diamond database

diamond makedb --in eggnog4.clustered_proteins.fa.gz --db eggnog_proteins