nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

KeyError: 'FUN_000001' in funannotate annotate #1016

Closed bpeacock44 closed 2 months ago

bpeacock44 commented 3 months ago

Are you using the latest release? Yes, running 1.8.17

Describe the bug When I run funannotate annotate on 2 of my 12 genomes, I immediately get the error: KeyError: 'FUN_000001' (or KeyError: 'FUN_000003' for the other genome.) The other 10 genomes have run successfully.

What command did you issue? funannotate annotate -i "${WDIR}/genomes/04_${F}_fun" --cpus 19 --iprscan "${WDIR}/genomes/04_${F}_fun/annotate_misc/iprscan.xml" --antismash "${DIR}/${filen}.ANTISMASH.gbk"

Logfiles I have not been able to find a funannotate log file - I don't think it was generated yet. But here is the output:

-------------------------------------------------------
[Mar 28 07:32 AM]: OS: Rocky Linux 8.8, 256 cores, ~ 1057 GB RAM. Python: 3.8.15
[Mar 28 07:32 AM]: Running 1.8.17
[Mar 28 07:32 AM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[Mar 28 07:32 AM]: Found existing output directory /rhome/bpeacock/bigdata/PN97_fungal_genomes/genomes/04_CIM1_fun. Warning, will re-use any intermediate files found.
[Mar 28 07:32 AM]: Parsing input files
[Mar 28 07:32 AM]: Existing tbl found: /rhome/bpeacock/bigdata/PN97_fungal_genomes/genomes/04_CIM1_fun/update_results/Arthrobotrys_flagrans_CIM1.tbl
Traceback (most recent call last):
  File "/opt/linux/rocky/8.x/x86_64/pkgs/funannotate/1.8.x/bin/funannotate", line 8, in <module>
    sys.exit(main())
  File "/opt/linux/rocky/8.x/x86_64/pkgs/funannotate/1.8.x/lib/python3.8/site-packages/funannotate/funannotate.py", line 717, in main
    mod.main(arguments)
  File "/opt/linux/rocky/8.x/x86_64/pkgs/funannotate/1.8.x/lib/python3.8/site-packages/funannotate/annotate.py", line 816, in main
    GeneCounts = lib.gb2nucleotides(
  File "/opt/linux/rocky/8.x/x86_64/pkgs/funannotate/1.8.x/lib/python3.8/site-packages/funannotate/library.py", line 3978, in gb2nucleotides
    gb_feature_add2dict(f, record, genes)
  File "/opt/linux/rocky/8.x/x86_64/pkgs/funannotate/1.8.x/lib/python3.8/site-packages/funannotate/library.py", line 4350, in gb_feature_add2dict
    if len(genes[locusTag]["mRNA"]) == 0:
KeyError: 'FUN_000001'

OS/Install Information For reference, I am using funannotate as installed on the UCR HPCC

-------------------------------------------------------
Checking dependencies for 1.8.17
-------------------------------------------------------
You are running Python v 3.8.15. Now checking python packages...
biopython: 1.79
goatools: 1.2.3
matplotlib: 3.4.3
natsort: 8.4.0
numpy: 1.24.4
pandas: 1.5.3
psutil: 5.9.8
requests: 2.31.0
scikit-learn: 1.3.2
scipy: 1.10.1
seaborn: 0.13.2
All 11 python packages installed

You are running Perl v b'5.032001'. Now checking perl modules...
Carp: 1.38
Clone: 0.46
DBD::SQLite: 1.72
DBD::mysql: 4.046
DBI: 1.643
DB_File: 1.858
Data::Dumper: 2.183
File::Basename: 2.85
File::Which: 1.24
Getopt::Long: 2.54
Hash::Merge: 0.302
JSON: 4.10
LWP::UserAgent: 6.67
Logger::Simple: 2.0
POSIX: 1.94
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.14
Tie::File: 1.06
URI::Escape: 5.12
YAML: 1.30
local::lib: 2.000029
threads: 2.25
threads::shared: 1.61
All 27 Perl modules installed

Checking Environmental Variables...
$FUNANNOTATE_DB=/srv/projects/db/funannotate/1.8
$PASAHOME=/rhome/bpeacock/.pasa
$TRINITYHOME=/opt/linux/rocky/8.x/x86_64/pkgs/funannotate/1.8.x/opt/trinity-2.8.5
$EVM_HOME=/bigdata/operations/pkgadmin/opt/linux/centos/8.x/x86_64/pkgs/funannotate/1.8.x/opt/evidencemodeler-1.1.1
$AUGUSTUS_CONFIG_PATH=/bigdata/operations/pkgadmin/opt/linux/centos/8.x/x86_64/pkgs/funannotate/1.8.x/config/
$GENEMARK_PATH=/opt/linux/rocky/8.x/x86_64/pkgs/genemarkESET/4.71_lic
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
        ERROR: ete3 found but error running ete3
        ERROR: tRNAscan-SE found but error running tRNAscan-SE
PASA: 2.5.2
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.5.0
bamtools: bamtools 2.5.1
bedtools: bedtools v2.31.1
blat: BLAT v37x1
diamond: 2.1.8
emapper.py: 2.1.12
exonerate: exonerate 2.4.0
fasta: 36.3.8g
glimmerhmm: 3.0.4
gmap: 2024-02-22
gmes_petap.pl: 4.71_lic
hisat2: 2.2.1
hmmscan: HMMER 3.4 (Aug 2023)
hmmsearch: HMMER 3.4 (Aug 2023)
java: 17.0.3-internal
kallisto: 0.46.1
mafft: v7.520 (2023/Mar/22)
makeblastdb: makeblastdb 2.14.1+
minimap2: 2.26-r1175
pigz: 2.8
proteinortho: 6.3.1
pslCDnaFilter: no way to determine
salmon: salmon 0.14.1
samtools: samtools 1.18
signalp: 6.0
snap: 2006-07-28
stringtie: 2.2.1
tantan: tantan 49
tbl2asn: 25.8
tblastn: tblastn 2.14.1+
trimal: trimAl v1.4.rev15 build[2013-12-17]
trimmomatic: 0.39
        ERROR: ete3 not installed
        ERROR: tRNAscan-SE not installed
bpeacock44 commented 3 months ago

Hi @nextgenusfs, as an update, I have four genomes that got this error. In 3 of them, the locus tag throwing the error is the first one listed in the genbank file. In once case, it's the third one - which I didn't expect. Is there something wrong with those specific locus tags? I'm not sure what to check.

nextgenusfs commented 2 months ago

Can you show the first few gene records from here: /rhome/bpeacock/bigdata/PN97_fungal_genomes/genomes/04_CIM1_fun/update_results/Arthrobotrys_flagrans_CIM1.tbl

bpeacock44 commented 2 months ago

Yes:

Feature scaffold_1 1 5622016 REFERENCE CFMR 12345 27454 28144 gene locus_tag FUN_000001 27454 27497 mRNA 27877 28144 product hypothetical protein transcript_id gnl|ncbi|FUN_000001-T1_mrna protein_id gnl|ncbi|FUN_000001-T1 27454 27497 CDS 27877 28144 codon_start 1 product hypothetical protein transcript_id gnl|ncbi|FUN_000001-T1_mrna protein_id gnl|ncbi|FUN_000001-T1 <30619 31119 gene locus_tag FUN_000002 <30619 31119 mRNA product hypothetical protein transcript_id gnl|ncbi|FUN_000002-T1_mrna protein_id gnl|ncbi|FUN_000002-T1 <30619 31119 CDS codon_start 1 product hypothetical protein transcript_id gnl|ncbi|FUN_000002-T1_mrna protein_id gnl|ncbi|FUN_000002-T1 43405 42323 gene locus_tag FUN_000003 43405 43168 mRNA 43097 42323 product hypothetical protein transcript_id gnl|ncbi|FUN_000003-T1_mrna protein_id gnl|ncbi|FUN_000003-T1 43022 42402 CDS codon_start 1 product hypothetical protein transcript_id gnl|ncbi|FUN_000003-T1_mrna protein_id gnl|ncbi|FUN_000003-T1

Here is from another genome that didn't throw the error:

Feature scaffold_1 1 8615222 REFERENCE CFMR 12345 12064 11594 gene locus_tag FUN_000001 12064 12038 mRNA 11890 11594 product hypothetical protein transcript_id gnl|ncbi|FUN_000001-T1_mrna protein_id gnl|ncbi|FUN_000001-T1 12064 12038 CDS 11890 11594 codon_start 1 product hypothetical protein transcript_id gnl|ncbi|FUN_000001-T1_mrna protein_id gnl|ncbi|FUN_000001-T1 20093 19278 gene locus_tag FUN_000002 20093 19278 mRNA product hypothetical protein transcript_id gnl|ncbi|FUN_000002-T1_mrna protein_id gnl|ncbi|FUN_000002-T1 20093 19278 CDS codon_start 1 product hypothetical protein transcript_id gnl|ncbi|FUN_000002-T1_mrna protein_id gnl|ncbi|FUN_000002-T1 23010 23347 gene locus_tag FUN_000003 23010 23139 mRNA 23193 23347 product hypothetical protein transcript_id gnl|ncbi|FUN_000003-T1_mrna protein_id gnl|ncbi|FUN_000003-T1 23010 23139 CDS 23193 23347 codon_start 1 product hypothetical protein transcript_id gnl|ncbi|FUN_000003-T1_mrna protein_id gnl|ncbi|FUN_000003-T1

nextgenusfs commented 2 months ago

Okay so the tbl file looks fine but the error log you posted is helpful it is saying that this function is failing:

'lib.gb2nucleotides'

That is a function that is parsing the genbank file (also in the predict_results output directory), so I'm wondering if tbl2asn has created a malformed genbank file.

bpeacock44 commented 2 months ago

Just fyi - I ran the entire pipeline from scratch on this one (CIM1) and annotate started to run successfully if I ran it before generating antismash or interproscan annotation. On some of the genomes I had issues with antismash - (e.g. #ERROR 27/03 13:27:56 Multiple CDS features have the same location: join{[82259:82303](+), [82448:82494](+), [82616:82646](+), [82766:82950](+), [83002:83142](+)}) and I was able to work around it using the suggestion here but this happened in some genomes that finished annotate successfully and also some unsuccessful ones. It also didn't occur on any of the genes that are throwing the error.

bpeacock44 commented 2 months ago

Alright, it does seem to be the antismash gbk I generated. If I remove the file and don't include --antismash in the funannotate annotate command it proceeds perfectly. (I tried to just not use --antismash before but it seems like if the file is present it still tried to use it for some reason?)

The workaround (removing any additional transcripts of locus tags from the gff and then using that gff to run antismash) seems to only have caused problems with some of the samples, since 4 of them get the KeyError during annotate but the others that had similar errors with antismash (multiple CDS features have the same location) and which I also processed using the workaround ran annotate just fine.

@nextgenusfs is there any particular recommendation you have for the antismash error regarding multiple CDS with the same location? I would like to annotate with antismash for all the genomes if possible. I found these errors were thrown when there were two identical locus tags with the same associated sequence (e.g. FUN_007543-T1, FUN_007543-T2) - I assume these are different transcripts?

These "removed" locus tags were not the ones that are throwing the KeyError in the annotate step so something overall must have been thrown off.

Many thanks

nextgenusfs commented 2 months ago

Okay, thanks for troubleshooting and posting. This is most likely due to the fact that genbank format is not capable of properly representing alternate transcripts (ie -T1, -T2, -T3, etc) -- that is genbank has no inherent way to link the transcript (mRNA) feature to the corresponding (CDS) feature, ie they are just linked by the locus_tag which then is probably causing problems with the genbank parser in antiSMASH. I've also had issues with their GFF parser, which is why I typically recommend using the genbank file. Perhaps one work around would be to generate a genbank file for antiSMASH that only has -T1 transcripts, ie no alt transcripts. Funannotate will always designate -T1 transcripts that were the most abundant based on the RNA-seq data, so ultimately for antiSMASH prediction this should make no difference. Let me see if I can come up with an easy button for this, I'll likely package in my GFFtk toolkit (which is now a separate package and is a dependency of funannotate2). But effectively we would want a GFF parser that would have an option like --strip-alt-transcripts to write only the primary transcript for each locus. Or a more robust solution would be something like --filter -T1 to be a more robust method of filtering/selecting gene models from annotation. Will probably take me several days to find the time to look into this. Apologies in advance.

bpeacock44 commented 2 months ago

No problem - thank you for your help!

bpeacock44 commented 2 months ago

Hi again @nextgenusfs, I wrote some code to strip alternate transcripts directly from the gbk before using it with antismash. Antismash finished without errors and when I used the output with annotate, it didn't give me the KeyError! I'm sure a more elegant solution would be beneficial though. Again I appreciate your help so I could figure out what needed to be fixed.