nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 85 forks source link

Final Genbank conversion failure #419

Closed isgilman closed 3 years ago

isgilman commented 4 years ago

Are you using the latest release? I am still using v1.6.0-dfd805f, however it looks like the pieces of code responsible for the error have not changed significantly since (I checked using compare).

Describe the bug funannotate annotate is hung at Converting to final Genbank format, good luck! or fails with

Traceback (most recent call last):
  File "/gpfs/ysm/project/edwards/isg4/conda_envs/super_funannotate/funannotate/bin/funannotate-functional.py", line 1018, in <module>
    BadProducts = lib.getFailedProductNames(discrep, Gene2ProdFinal) #return dict containing tuples of (GeneName, GeneProduct, [reason])
  File "/gpfs/ysm/project/edwards/isg4/conda_envs/super_funannotate/funannotate/lib/library.py", line 5512, in getFailedProductNames
    if 'DiscRep_SUB:SUSPECT_PRODUCT_NAMES::' in block[0]:
IndexError: list index out of range

What command did you issue?

funannotate annotate -i Portulaca-amilis.v0-FA1.6.0/ --eggnog /gpfs/ysm/scratch60/edwards/isg4/Pamilis_funannotate/ANNOTATE/Portulaca-amilis.v0-FA1.6.0/eggNOG/emapper.annotations --iprscan /gpfs/ysm/scratch60/edwards/isg4/Pamilis_funannotate/ANNOTATE/Portulaca-amilis.v0-FA1.6.0/InterProScan/5.42-78.0/Portulaca_amilis.proteins.fa.xml --busco_db embryophyta --cpus 20 --species "Portulaca amilis" --sbt Portulaca-amilis.v0-FA1.6.0/update_results/MIGS.eu.5.0.tsv

Logfiles

[11:07 AM]: OS: linux2, 36 cores, ~ 196 GB RAM. Python: 2.7.15
[11:07 AM]: Running funannotate v1.6.0-dfd805f
[11:07 AM]: Output directory Portulaca-amilis.v0-FA1.6.0 already exists, will use any existing data.  If this is not what you want, exit, and provide a unique name for output folder
[11:07 AM]: Parsing input files
[11:07 AM]: Existing tbl found: Portulaca-amilis.v0-FA1.6.0/update_results/Portulaca_amilis.tbl
[11:07 AM]: Adding Functional Annotation to Portulaca amilis, NCBI accession: None
[11:07 AM]: Annotation consists of: 53,007 gene models
[11:07 AM]: 58,571 protein records loaded
[11:07 AM]: Existing Pfam-A results found: Portulaca-amilis.v0-FA1.6.0/annotate_misc/annotations.pfam.txt
[11:07 AM]: 49,237 annotations added
[11:07 AM]: Running Diamond blastp search of UniProt DB version 2020_01
[11:07 AM]: 7,288 valid gene/product annotations from 10,848 total
[11:07 AM]: Existing Eggnog-mapper results found: Portulaca-amilis.v0-FA1.6.0/annotate_misc/eggnog.emapper.annotations
[11:07 AM]: Parsing EggNog Annotations
[11:07 AM]: 18,835 COG and EggNog annotations added
[11:07 AM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.55
[11:07 AM]: 7,288 gene name and product description annotations added
[11:07 AM]: Existing MEROPS results found: Portulaca-amilis.v0-FA1.6.0/annotate_misc/annotations.merops.txt
[11:07 AM]: 1,266 annotations added
[11:07 AM]: Existing CAZYme results found: Portulaca-amilis.v0-FA1.6.0/annotate_misc/annotations.dbCAN.txt
[11:07 AM]: 2,073 annotations added
[11:07 AM]: Existing BUSCO2 results found: Portulaca-amilis.v0-FA1.6.0/annotate_misc/annotations.busco.txt
[11:07 AM]: 1,893 annotations added
[11:07 AM]: Skipping phobius predictions, try funannotate remote -m phobius
[11:07 AM]: Skipping secretome: neither SignalP nor Phobius searches were run
[11:07 AM]: 0 secretome and 0 transmembane annotations added
[11:07 AM]: Parsing InterProScan5 XML file
[11:08 AM]: Found 0 duplicated annotations, adding 247,018 valid annotations
[11:08 AM]: Converting to final Genbank format, good luck!
Traceback (most recent call last):
  File "/gpfs/ysm/project/edwards/isg4/conda_envs/super_funannotate/funannotate/bin/funannotate-functional.py", line 1018, in <module>
    BadProducts = lib.getFailedProductNames(discrep, Gene2ProdFinal) #return dict containing tuples of (GeneName, GeneProduct, [reason])
  File "/gpfs/ysm/project/edwards/isg4/conda_envs/super_funannotate/funannotate/lib/library.py", line 5512, in getFailedProductNames
    if 'DiscRep_SUB:SUSPECT_PRODUCT_NAMES::' in block[0]:
IndexError: list index out of range

OS/Install Information

-------------------------------------------------------
Checking dependencies for funannotate v1.6.0-dfd805f
-------------------------------------------------------
You are running Python v 2.7.15. Now checking python packages...
biopython: 1.73
goatools: 0.8.12
matplotlib: 2.2.3
natsort: 6.0.0
numpy: 1.16.3
pandas: 0.24.2
psutil: 5.6.2
requests: 2.21.0
scikit-learn: 0.20.3
scipy: 1.2.1
seaborn: 0.9.0
All 11 python packages installed

You are running Perl v 5.026002. Now checking perl modules...
Bio::Perl: 1.007002
Carp: 1.38
Clone: 0.41
DBD::SQLite: 1.60
DBD::mysql: 4.046
DBI: 1.642
DB_File: 1.852
Data::Dumper: 2.173
File::Basename: 2.85
File::Which: 1.23
Getopt::Long: 2.5
Hash::Merge: 0.300
JSON: 4.00
LWP::UserAgent: 6.36
Logger::Simple: 2.0
POSIX: 1.76
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.11
Text::Soundex: 3.05
Thread::Queue: 3.13
Tie::File: 1.02
URI::Escape: 3.31
YAML: 1.27
threads: 2.21
threads::shared: 1.59
All 27 Perl modules installed

Checking external dependencies...
CodingQuarry: 2.0
RepeatMasker: RepeatMasker 4.0.8
RepeatModeler: RepeatModeler version DEV
Trinity: 2.5.1
augustus: 3.2.3
bamtools: bamtools 2.4.1
bedtools: bedtools v2.28.0
blat: BLAT v36
diamond: diamond 0.9.24
emapper.py: emapper-0.12.7
ete3: 3.1.1
exonerate: exonerate 2.4.0
fasta: no way to determine
gmap: 2018-07-04
gmes_petap.pl: 4.38
hisat2: 2.1.0
hmmscan: HMMER 3.2.1 (June 2018)
hmmsearch: HMMER 3.2.1 (June 2018)
java: 11.0.1
kallisto: 0.45.1
mafft: v7.407 (2018/Jul/23)
makeblastdb: makeblastdb 2.6.0+
minimap2: 2.16-r922
nucmer: 3.1
pslCDnaFilter: no way to determine
rmblastn: rmblastn 2.6.0+
samtools: samtools 1.9
stringtie: 1.3.6
tRNAscan-SE: 2.0 (December 2017)
tbl2asn: unknown, likely 25.3
tblastn: tblastn 2.6.0+
trimal: trimAl v1.4.rev15 build[2013-12-17]
All 32 external dependencies are installed

Checking Environmental Variables...
$FUNANNOTATE_DB=/gpfs/ysm/scratch60/isg4/Pamilis_FUNannotate/TRAIN/funannotate_database
$PASAHOME=/gpfs/ysm/project/isg4/conda_envs/super_funannotate/funannotate_deps/PASApipeline
$TRINITYHOME=/gpfs/ysm/project/isg4/conda_envs/super_funannotate/opt/trinity-2.5.1
$EVM_HOME=/gpfs/ysm/project/isg4/conda_envs/super_funannotate/funannotate_deps/evidencemodeler
$AUGUSTUS_CONFIG_PATH=/gpfs/ysm/project/isg4/conda_envs/super_funannotate/config/
$GENEMARK_PATH=/gpfs/ysm/project/isg4/conda_envs/super_funannotate/funannotate_deps/gmes_petap
$BAMTOOLS_PATH=/gpfs/ysm/project/isg4/conda_envs/super_funannotate/bin
All 7 environmental variables are set

I know this version is out of date but we're so close to the final annotation! Thank you for all the help!

Ian

nextgenusfs commented 4 years ago

Looks like maybe tbl2asn didn’t finish or crashed? There should be more info in the logfile I think? Is this the terminal stdout or the logfile?

isgilman commented 4 years ago

This was terminal stdout, and you're right that it was last running tbl2sn. The final command in funannotate-annotate.log was

tbl2asn -y "Annotated using funannotate v1.6.0-dfd805f" -N 1 -t Portulaca-amilis.v0-FA1.6.0/update_results/MIGS.eu.5.0.tsv -M n -j "[organism=Portulaca amilis]" -V b -c fx -T -a r10u -l paired-ends -Z Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/1/discrepency.report.txt -p Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/1

I've tried running this in as an interactive job in slurm (with srun) and submitting as a batch file (with sbatch). Running interactively funannotate annotate never finishes, even after 12 hours. Running as a batch job it finished quickly but throws the IndexError above, which prints to stderr.

I tried running the command again (but forgot to add my emapper results and got this log:

[05/05/20 09:50:19]: /gpfs/ysm/project/edwards/isg4/conda_envs/super_funannotate/funannotate/bin/funannotate-functional.py -i Portulaca-amilis.v0-FA1.6.0/ --iprscan /gpfs/ysm/scratch60/edwards/isg4/Pamilis_funannotate/ANNOTATE/Portulaca-amilis.v0-FA1.6.0/InterProScan/5.42-78.0/Portulaca_amilis.proteins.fa.xml --busco_db embryophyta --cpus 20 --species Portulaca amilis --sbt Portulaca-amilis.v0-FA1.6.0/update_results/MIGS.eu.5.0.tsv

[05/05/20 09:50:19]: OS: linux2, 20 cores, ~ 131 GB RAM. Python: 2.7.15
[05/05/20 09:50:19]: Running funannotate v1.6.0-dfd805f
[05/05/20 09:50:19]: Output directory Portulaca-amilis.v0-FA1.6.0 already exists, will use any existing data.  If this is not what you want, exit, and provide a unique name for output folder
[05/05/20 09:50:19]: Parsing input files
[05/05/20 09:50:19]: Existing tbl found: Portulaca-amilis.v0-FA1.6.0/update_results/Portulaca_amilis.tbl
[05/05/20 09:51:20]: Adding Functional Annotation to Portulaca amilis, NCBI accession: None
[05/05/20 09:51:20]: Annotation consists of: 53,007 gene models
[05/05/20 09:51:20]: 58,571 protein records loaded
[05/05/20 09:51:21]: Existing Pfam-A results found: Portulaca-amilis.v0-FA1.6.0/annotate_misc/annotations.pfam.txt
[05/05/20 09:51:21]: 49,237 annotations added
[05/05/20 09:51:21]: Running Diamond blastp search of UniProt DB version 2020_01
[05/05/20 09:51:33]: 7,288 valid gene/product annotations from 10,848 total
[05/05/20 09:51:34]: Running Eggnog-mapper
[05/05/20 09:51:34]: emapper.py -m diamond -i /gpfs/ysm/scratch60/edwards/isg4/Pamilis_funannotate/TRAIN/Portulaca-amilis.v0-FA1.6.0/annotate_misc/genome.proteins.fasta -o eggnog --cpu 20
[05/05/20 09:51:34]: Annotation database data/eggnog.db not present. Use download_eggnog_database.py to fetch it

[05/05/20 09:51:34]: No Eggnog-mapper results found.
[05/05/20 09:51:34]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.55
[05/05/20 09:51:35]: 7,288 gene name and product description annotations added
[05/05/20 09:51:35]: Existing MEROPS results found: Portulaca-amilis.v0-FA1.6.0/annotate_misc/annotations.merops.txt
[05/05/20 09:51:35]: 1,266 annotations added
[05/05/20 09:51:35]: Existing CAZYme results found: Portulaca-amilis.v0-FA1.6.0/annotate_misc/annotations.dbCAN.txt
[05/05/20 09:51:35]: 2,073 annotations added
[05/05/20 09:51:35]: Existing BUSCO2 results found: Portulaca-amilis.v0-FA1.6.0/annotate_misc/annotations.busco.txt
[05/05/20 09:51:35]: 1,893 annotations added
[05/05/20 09:51:35]: Skipping phobius predictions, try funannotate remote -m phobius
[05/05/20 09:51:35]: Skipping secretome: neither SignalP nor Phobius searches were run
[05/05/20 09:51:35]: 0 secretome and 0 transmembane annotations added
[05/05/20 09:51:36]: Parsing InterProScan5 XML file
[05/05/20 09:51:36]: /gpfs/ysm/project/isg4/conda_envs/super_funannotate/bin/python /gpfs/ysm/project/edwards/isg4/conda_envs/super_funannotate/funannotate/util/iprscan2annotations.py Portulaca-amilis.v0-FA1.6.0/annotate_misc/iprscan.xml Portulaca-amilis.v0-FA1.6.0/annotate_misc/annotations.iprscan.txt
[05/05/20 09:52:10]: Found 0 duplicated annotations, adding 228,183 valid annotations
[05/05/20 09:52:11]: Parsing tbl file: /gpfs/ysm/scratch60/edwards/isg4/Pamilis_funannotate/TRAIN/Portulaca-amilis.v0-FA1.6.0/annotate_misc/genome.tbl
[05/05/20 09:52:13]: Converting to final Genbank format, good luck!
[05/05/20 09:52:22]: tbl2asn -y "Annotated using funannotate v1.6.0-dfd805f" -N 1 -t Portulaca-amilis.v0-FA1.6.0/update_results/MIGS.eu.5.0.tsv -M n -j "[organism=Portulaca amilis]" -V b -c fx -T -a r10u -l paired-ends -Z Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/3/discrepency.report.txt -p Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/3
[05/05/20 09:52:22]: tbl2asn -y "Annotated using funannotate v1.6.0-dfd805f" -N 1 -t Portulaca-amilis.v0-FA1.6.0/update_results/MIGS.eu.5.0.tsv -M n -j "[organism=Portulaca amilis]" -V b -c fx -T -a r10u -l paired-ends -Z Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/5/discrepency.report.txt -p Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/5
[05/05/20 09:52:22]: tbl2asn -y "Annotated using funannotate v1.6.0-dfd805f" -N 1 -t Portulaca-amilis.v0-FA1.6.0/update_results/MIGS.eu.5.0.tsv -M n -j "[organism=Portulaca amilis]" -V b -c fx -T -a r10u -l paired-ends -Z Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/2/discrepency.report.txt -p Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/2
[05/05/20 09:52:22]: tbl2asn -y "Annotated using funannotate v1.6.0-dfd805f" -N 1 -t Portulaca-amilis.v0-FA1.6.0/update_results/MIGS.eu.5.0.tsv -M n -j "[organism=Portulaca amilis]" -V b -c fx -T -a r10u -l paired-ends -Z Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/4/discrepency.report.txt -p Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/4
[05/05/20 09:52:22]: tbl2asn -y "Annotated using funannotate v1.6.0-dfd805f" -N 1 -t Portulaca-amilis.v0-FA1.6.0/update_results/MIGS.eu.5.0.tsv -M n -j "[organism=Portulaca amilis]" -V b -c fx -T -a r10u -l paired-ends -Z Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/1/discrepency.report.txt -p Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/1
nextgenusfs commented 4 years ago

Things to try: upgrade tbl2asn and see if that fixes behavior. Because you have a large genome it is trying to split the input and run in parallel. I think this code was updated in most recent version so it’s possible updating funannotate could fix it. Alternatively you can try to run the tbl2asn command manually to generate the genbank output and subsequent submission files. Typically if you can run interactively it can be easier to spot errors that the program is outputting. I seem to recall there being a warning after tbl2asn is over 1 year old to update it, it could be that causing it to die silently I suppose.

isgilman commented 4 years ago

Thanks again for the help, I've made a little progress on this issue but haven't gotten annotate to complete. At your advice I checked tbl2asn, and it was giving me the "over 1 year old" error. I'd already made an installation of the latest version of funannotate (1.7.4), but got the same issue because conda's version of tbl2asn is 25.7, but the most recent version from NCBI is 25.8.

I copied the new version into conda_envs/funannotate/bin/, which works, but annotate still failed:

[05/11/20 11:42:48]: ERROR: GBK file conversion failed, tbl2asn parallel script has died

So I followed up on running the tbl2asn command independently and this resulted in a problem with tbl2asn_parallel.py:

(/gpfs/ysm/project/edwards/isg4/conda_envs/funannotate) [isg4@c14n02 TRAIN]$  /gpfs/ysm/project/edwards/isg4/conda_envs/funannotate/bin/python /gpfs/ysm/project/edwards/isg4/conda_envs/funannotate/lib/python2.7/site-packages/funannotate/aux_scripts/tbl2asn_parallel.py -i Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/genome.tbl -f Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn/genome.fsa -o Portulaca-amilis.v0-FA1.6.0/annotate_misc/tbl2asn --sbt Portulaca-amilis.v0-FA1.6.0/update_results/MIGS.eu.5.0.tsv -d discrepency.report.txt -s Portulaca_amilis -t -l paired-ends -v 1 -c 20
usage: tbl2asn_parallel.py [-h] -i INPUT -f FASTA -s SPECIES -o OUT --sbt SBT
                           [--isolate ISOLATE] [--strain STRAIN] [-c CPUS] -d
                           DISCREP [-t TBL2ASN] [-v VERSION]
tbl2asn_parallel.py: error: argument -t/--tbl2asn: expected one argument

I thought this might have to do with the way annotate.py is passing the subprocess to tbl2asn_parallel.py, particularly the -t -l paired-ends. It looked like -t was receiving no argument and that an unknown flag (-l) was receiving paired-ends. Since I'm not modifying the behavior of tbl2asn and tbl2asn_parallel.py creates a command by default with -l paired-ends, I edited cmd (lines 1078-1083 in annotate.py) to remove the respective arguments from

cmd = [sys.executable, os.path.join(parentdir, 'aux_scripts', 'tbl2asn_parallel.py'),
           '-i', TBLOUT, '-f', os.path.join(outputdir,
                                            'annotate_misc', 'tbl2asn', 'genome.fsa'),
           '-o', os.path.join(outputdir, 'annotate_misc',
                              'tbl2asn'), '--sbt', SBT, '-d', discrep,
           '-s', organism, '-t', args.tbl2asn, '-v', str(annot_version), '-c', str(args.cpus)]

to

cmd = [sys.executable, os.path.join(parentdir, 'aux_scripts', 'tbl2asn_parallel.py'),
           '-i', TBLOUT, '-f', os.path.join(outputdir,
                                            'annotate_misc', 'tbl2asn', 'genome.fsa'),
           '-o', os.path.join(outputdir, 'annotate_misc',
                              'tbl2asn'), '--sbt', SBT, '-d', discrep,
           '-s', organism, '-v', str(annot_version), '-c', str(args.cpus)]

This ended up working, however it now looks like the results from tbl2asn are not being combined correctly. The files errorsummary.val, genome.gbf, genome.val, and discrepancy.report.txt are all empty.

[isg4@c14n02 tbl2asn]$ ls -lha
drwxr-xr-x 2 isg4 edwards 4.0K May 11 12:06 1
drwxr-xr-x 2 isg4 edwards 4.0K May 11 12:06 2
drwxr-xr-x 2 isg4 edwards 4.0K May 11 12:06 3
drwxr-xr-x 2 isg4 edwards 4.0K May 11 12:06 4
drwxr-xr-x 2 isg4 edwards 4.0K May 11 12:06 5
-rw-r--r-- 1 isg4 edwards    0 May 11 12:06 errorsummary.val
-rw-r--r-- 1 isg4 edwards 7.2M May 11 12:06 genome1.tbl
-rw-r--r-- 1 isg4 edwards 7.3M May 11 12:06 genome2.tbl
-rw-r--r-- 1 isg4 edwards 7.1M May 11 12:06 genome3.tbl
-rw-r--r-- 1 isg4 edwards 8.0M May 11 12:06 genome4.tbl
-rw-r--r-- 1 isg4 edwards 3.5M May 11 12:06 genome5.tbl
-rw-r--r-- 1 isg4 edwards 391M May 11 12:06 genome.fsa
-rw-r--r-- 1 isg4 edwards    0 May 11 12:06 genome.gbf
-rw-r--r-- 1 isg4 edwards  33M May 11 12:06 genome.tbl
-rw-r--r-- 1 isg4 edwards    0 May 11 12:06 genome.val

Any thoughts or suggestions would be appreciated, Ian