nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
321 stars 85 forks source link

annotate not parsing COGs if input is gbk #808

Closed xvazquezc closed 1 year ago

xvazquezc commented 2 years ago

Are you using the latest release? Currently in 1.8.11

Describe the bug funannotate annotate doesn't parse the COG annotations into the gbk file output if the input is a gbk file, but it works without problem if given the "predict" folder. I'm reannotating some genomes from GenBank to compare with my own and I found out because when running funannotate compare I would get the same error as reported in #682, as funannotate annotate doesn't give any error. The eggnog-mapper input seems to be parsed without problems COG annotation entries in annotate_misc/annotations.eggnog.txt are there and look normal.

What command did you issue?

funannotate annotate --genbank genome.gbk --out annotate --eggnog emapper.emapper.annotations \
--antismash antismash/genome.gbk --iprscan iprs.xml --signalp signalp/prediction_results.txt --cpus 12 --no-progress

Logfiles funannotate annotate doesn't throw any error, based on the log all looks OK, it just come up if you examine the files or run funannotate compare as mentioned above.

OS/Install Information

-------------------------------------------------------
Checking dependencies for 1.8.11
-------------------------------------------------------
You are running Python v 3.8.12. Now checking python packages...
biopython: 1.77
goatools: 1.2.3
matplotlib: 3.4.3
natsort: 8.1.0
numpy: 1.22.3
pandas: 1.4.2
psutil: 5.9.0
requests: 2.27.1
scikit-learn: 1.0.2
scipy: 1.8.0
seaborn: 0.11.2
All 11 python packages installed

You are running Perl v b'5.026002'. Now checking perl modules...
Carp: 1.38
Clone: 0.42
DBD::SQLite: 1.64
DBD::mysql: 4.046
DBI: 1.642
DB_File: 1.855
Data::Dumper: 2.173
File::Basename: 2.85
File::Which: 1.23
Getopt::Long: 2.5
Hash::Merge: 0.300
JSON: 4.02
LWP::UserAgent: 6.39
Logger::Simple: 2.0
POSIX: 1.76
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.12
Tie::File: 1.02
URI::Escape: 3.31
YAML: 1.29
local::lib: 2.000024
threads: 2.15
threads::shared: 1.56
All 27 Perl modules installed

Checking Environmental Variables...
$FUNANNOTATE_DB=/share/bioinfo/z3382651/funannotate_db
$PASAHOME=/share/bioinfo/z3382651/miniconda3/envs/funannotate-master/opt/pasa-2.4.1
$TRINITY_HOME=/share/bioinfo/z3382651/miniconda3/envs/funannotate-master/opt/trinity-2.8.5
$EVM_HOME=/share/bioinfo/z3382651/miniconda3/envs/funannotate-master/opt/evidencemodeler-1.1.1
$AUGUSTUS_CONFIG_PATH=/share/bioinfo/z3382651/miniconda3/envs/funannotate-master/config/
$GENEMARK_PATH=/share/bioinfo/z3382651/gmes_linux_64
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
PASA: 2.4.1
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.3.3
bamtools: bamtools 2.5.1
bedtools: bedtools v2.30.0
blat: BLAT v36
diamond: 2.0.14
emapper.py: 2.1.2
ete3: 3.1.2
exonerate: exonerate 2.4.0
fasta: no way to determine
glimmerhmm: 3.0.4
gmap: 2017-11-15
hisat2: 2.2.1
hmmscan: HMMER 3.3.2 (Nov 2020)
hmmsearch: HMMER 3.3.2 (Nov 2020)
java: 11.0.13
kallisto: 0.46.1
mafft: v7.505 (2022/Apr/10)
makeblastdb: makeblastdb 2.2.31+
minimap2: 2.24-r1122
pigz: pigz 2.6
proteinortho: 6.0.34
pslCDnaFilter: no way to determine
salmon: salmon 0.14.1
samtools: samtools 1.12
signalp: 5.0b
snap: 2006-07-28
stringtie: 2.2.1
tRNAscan-SE: 2.0.9 (July 2021)
tantan: tantan 31
tbl2asn: no way to determine, likely 25.X
tblastn: tblastn 2.2.31+
trimal: trimAl v1.4.rev15 build[2013-12-17]
trimmomatic: 0.39
    ERROR: gmes_petap.pl not installed
xvazquezc commented 2 years ago

I've been looking further into this and basically most annotation sources are ignored and not passed to annotations_misc/all.annotations.txt. For what I could see, this affects (rows not appearing in annotations_misc/all.annotations.txt) the GO terms, EggNOG, COG, SMCOG, InterPro, SECRETED, name and product. the individual annotations_misc/annotations.*.txt files are generated without problem.

kellystyles commented 1 year ago

Hi there @xvazquezc. Did you end up figuring a workaround for this? I'm getting the same bug when I am trying to use Funannotate for GenBank genomes with gene predictions but lacking functional annotations. I don't really want to perform the prediction steps as that may reduce accuracy. Its a little frustrating that all the information is just sitting there but not being parsed

xvazquezc commented 1 year ago

@kellystyles kinda my situation too. Unfortunately I didn't follow through...

nextgenusfs commented 1 year ago

funannotate compare is only going to output data properly if all genomes you are comparing have had functional annotation added with funannotate annotate. So if you have a public genome that's fine, add functional annotation to it with funannotate annotate and use that resulting annotated GBK file for compare.

xvazquezc commented 1 year ago

@nextgenusfs the problem is that funannotate annotate doesn't add the annotations if your genome comes from an externally gene-called genome. All the annotation sources are there but they are not passed to annotations_misc/all.annotations.txt as I mention above , and as such they are not incorporated in the gbk files

nextgenusfs commented 1 year ago

It must be something specific with a genbank file you are using, sometimes old genbank files have locus tags that are problematic.

here is an example

$ funannotate annotate --genbank GCF_000149615.1_ASM14961v1_genomic.gbff -o aterreus --cpus 7
-------------------------------------------------------
[May 04 06:18 PM]: OS: MacOSX 10.16, 8 cores, ~ 17 GB RAM. Python: 3.7.12
[May 04 06:18 PM]: Running 1.8.15
[May 04 06:18 PM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[May 04 06:18 PM]: Checking GenBank file for annotation
Skipped 3 annotations: 3 pseudo genes; 0 no CDS; 0 duplicated features
[May 04 06:18 PM]: Adding Functional Annotation to Aspergillus terreus NIH2624, NCBI accession: WGS:AAJN
[May 04 06:18 PM]: Annotation consists of: 10,551 gene models
[May 04 06:18 PM]: 10,401 protein records loaded
[May 04 06:18 PM]: Running HMMer search of PFAM version 35.0
[May 04 06:24 PM]: 12,937 annotations added
[May 04 06:24 PM]: Running Diamond blastp search of UniProt DB version 2022_04
[May 04 06:26 PM]: 892 valid gene/product annotations from 1,673 total
[May 04 06:26 PM]: Running Eggnog-mapper
[May 04 07:24 PM]: Parsing EggNog Annotations
[May 04 07:24 PM]: EggNog version parsed as 2.1.6
[May 04 07:24 PM]: 20,696  COG and EggNog annotations added
[May 04 07:24 PM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.84
[May 04 07:24 PM]: 2,681 gene name and product description annotations added
[May 04 07:24 PM]: Running Diamond blastp search of MEROPS version 12.0
[May 04 07:24 PM]: 353 annotations added
[May 04 07:24 PM]: Annotating CAZYmes using HMMer search of dbCAN version 11.0
[May 04 07:25 PM]: 546 annotations added
[May 04 07:25 PM]: Annotating proteins with BUSCO dikarya models
[May 04 07:26 PM]: 1,154 annotations added
[May 04 07:26 PM]: Skipping phobius predictions, try funannotate remote -m phobius
[May 04 07:26 PM]: Predicting secreted proteins with SignalP
[May 04 07:31 PM]: 977 secretome and 0 transmembane annotations added
[May 04 07:31 PM]: InterProScan error, aterreus/annotate_misc/iprscan.xml is empty, or no XML file passed via --iprscan. Functional annotation will be lacking.
[May 04 07:31 PM]: Found 0 duplicated annotations, adding 42,025 valid annotations
[May 04 07:31 PM]: Detected NCBI reannotation, but couldn't locate p2g file, please pass via --p2g
[May 04 07:31 PM]: Converting to final Genbank format, good luck!
[May 04 07:32 PM]: Creating AGP file and corresponding contigs file
[May 04 07:32 PM]: Writing genome annotation table.
[May 04 07:32 PM]: Funannotate annotate has completed successfully!

        We need YOUR help to improve gene names/product descriptions:
           0 gene/products names MUST be fixed, see aterreus/annotate_results/Gene2Products.must-fix.txt
           1 gene/product names need to be curated, see aterreus/annotate_results/Gene2Products.need-curating.txt
           7 gene/product names passed but are not in Database, see aterreus/annotate_results/Gene2Products.new-names-passed.txt

        Please consider contributing a PR at https://github.com/nextgenusfs/gene2product

And then parsing it through compare to show you it works...

$ funannotate compare -i aterreus/annotate_results/Aspergillus_terreus_NIH2624_NIH2624.gbk -o aterreus_compare
-------------------------------------------------------
[May 04 07:40 PM]: OS: MacOSX 10.16, 8 cores, ~ 17 GB RAM. Python: 3.7.12
[May 04 07:40 PM]: Running 1.8.15
[May 04 07:40 PM]: Now parsing 1 genomes
[May 04 07:40 PM]: working on Aspergillus terreus NIH2624
[May 04 07:40 PM]: No secondary metabolite annotations found
[May 04 07:40 PM]: Summarizing PFAM domain results
[May 04 07:40 PM]: Summarizing InterProScan results
[May 04 07:40 PM]: Loading InterPro descriptions
[May 04 07:40 PM]: Summarizing MEROPS protease results
[May 04 07:40 PM]: Summarizing CAZyme results
[May 04 07:40 PM]: Summarizing COG results
[May 04 07:40 PM]: Summarizing secreted protein results
[May 04 07:40 PM]: Summarizing fungal transcription factors
[May 04 07:40 PM]: No transcription factor IPR domains found
[May 04 07:40 PM]: Compiling all annotations for each genome
[May 04 07:40 PM]: Skipping RAxML phylogeny as at least 4 taxa are required
[May 04 07:40 PM]: Compressing results to output file: aterreus_compare.tar.gz
[May 04 07:40 PM]: Funannotate compare completed successfully!

And the resulting web output:

image
xvazquezc commented 1 year ago

All the genomes I was re-annotating were ca. 2020. Noticed some of the files were partially annotated. Would that be an issue?

xvazquezc commented 1 year ago

I found the issue. For the genomes I was re-annotating, I got the protein files from NCBI. Funannotate uses the locus_tag to create the IDs of both genes and prots, NCBI doesn't. So all the tools I ran externally couldn't match the IDs. I wasn't aware of some of the funannotate util that help dealing with this.

@kellystyles check if you did something like that