nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
301 stars 82 forks source link

funannotate test - Not enough gene models to train Augustus #552

Closed rpetit3 closed 3 years ago

rpetit3 commented 3 years ago

Are you using the latest release? Yep (I think)! Installed through Conda

funannotate --version
funannotate v1.8.3

Describe the bug I ran the test set, and I got the following error:

[06:02 PM]: Not enough gene models 175 to train Augustus (200 required), exiting

This might be related to https://github.com/nextgenusfs/funannotate/issues/418 (tblastn exiting earliy), but there isn't a way to set --min_training_models. I'm trying a rerun with 1 cpu to rule out multithreaded tblastn causing BUSCO to exit early. Will let you know if it succeeds or not.

What command did you issue?

funannotate test -t all --cpus 8         

Logfiles

funannotate test -t all --cpus 8                                                                                                                                                                    #########################################################
Running `funannotate clean` unit testing: minimap2 mediated assembly duplications
Downloading: https://osf.io/8pjbe/download?version=1 Bytes: 252076
CMD: funannotate clean -i test.clean.fa -o test.exhaustive.fa --exhaustive
#########################################################
-----------------------------------------------
6 input contigs, 6 larger than 500 bp, N50 is 427,039 bp
Checking duplication of 6 contigs
-----------------------------------------------
scaffold_73 appears duplicated: 100% identity over 100% of the contig. contig length: 15153
scaffold_91 appears duplicated: 100% identity over 100% of the contig. contig length: 8858
scaffold_27 appears duplicated: 100% identity over 100% of the contig. contig length: 427039
-----------------------------------------------
6 input contigs; 6 larger than 500 bp; 3 duplicated; 3 written to file
#########################################################
SUCCESS: `funannotate clean` test complete.
#########################################################

#########################################################
Running `funannotate mask` unit testing: RepeatModeler --> RepeatMasker
Downloading: https://osf.io/hbryz/download?version=1 Bytes: 375687
CMD: funannotate mask -i test.fa -o test.masked.fa --cpus 8
#########################################################
-------------------------------------------------------
[05:07 PM]: OS: Debian GNU/Linux 10, 24 cores, ~ 264 GB RAM. Python: 3.8.6
[05:07 PM]: Running funanotate v1.8.3
[05:07 PM]: Soft-masking simple repeats with tantan
[05:07 PM]: Repeat soft-masking finished:
Masked genome: /home/rpetit3/test-genemark/test-mask_3801/test.masked.fa
num scaffolds: 2
assembly size: 1,216,048 bp
masked repeats: 50,965 bp (4.19%)
-------------------------------------------------------
#########################################################
SUCCESS: `funannotate mask` test complete.
#########################################################

#########################################################
Running `funannotate predict` unit testing
Downloading: https://osf.io/te2pf/download?version=1 Bytes: 1489808
CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate --augustus_species saccharomyces --cpus 8 --species Awesome testicus
#########################################################
-------------------------------------------------------
[05:07 PM]: OS: Debian GNU/Linux 10, 24 cores, ~ 264 GB RAM. Python: 3.8.6
[05:07 PM]: Running funannotate v1.8.3
[05:07 PM]: Skipping CodingQuarry as no --rna_bam passed
[05:07 PM]: Parsed training data, run ab-initio gene predictors as follows:
  Program      Training-Method
  augustus     pretrained
  genemark     selftraining
  glimmerhmm   busco
  snap         busco
[05:07 PM]: Loading genome assembly and parsing soft-masked repetitive sequences
[05:07 PM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked
[05:07 PM]: Mapping 1,065 proteins to genome using diamond and exonerate
[05:07 PM]: Found 1,784 preliminary alignments --> aligning with exonerate
[05:09 PM]: Exonerate finished: found 1,433 alignments
[05:09 PM]: Running GeneMark-ES on assembly
[05:09 PM]: GeneMark-ES failed: annotate/predict_misc/genemark/output/gmhmm.mod file missing, please check logfiles.
[05:09 PM]: GeneMark predictions failed. If you can run GeneMark outside of funannotate, then pass the results to --genemark_gtf.
[05:09 PM]: Running BUSCO to find conserved gene models for training ab-initio predictors
[05:27 PM]: 175 valid BUSCO predictions found, validating protein sequences
[05:28 PM]: 175 BUSCO predictions validated
[05:28 PM]: Running Augustus gene prediction using saccharomyces parameters
[05:33 PM]: 1,492 predictions from Augustus
[05:33 PM]: Pulling out high quality Augustus predictions
[05:33 PM]: Found 372 high quality predictions from Augustus (>90% exon evidence)
[05:33 PM]: Running SNAP gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[05:34 PM]: 0 predictions from SNAP
[05:34 PM]: SNAP prediction failed, moving on without result
[05:34 PM]: Running GlimmerHMM gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[05:38 PM]: 1,597 predictions from GlimmerHMM
[05:38 PM]: Summary of gene models passed to EVM (weights):
  Source         Weight   Count
  Augustus       1        1331
  Augustus HiQ   2        373
  GlimmerHMM     1        1597
  Total          -        3301
[05:38 PM]: EVM: partitioning input to ~ 35 genes per partition
[05:46 PM]: Converting to GFF3 and collecting all EVM results
[05:46 PM]: 1,727 total gene models from EVM
[05:46 PM]: Generating protein fasta files from 1,727 EVM models
[05:46 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[05:46 PM]: Found 154 gene models to remove: 0 too short; 0 span gaps; 154 transposable elements
[05:46 PM]: 1,573 gene models remaining
[05:46 PM]: Predicting tRNAs
[05:47 PM]: 112 tRNAscan models are valid (non-overlapping)
[05:47 PM]: Generating GenBank tbl annotation file
[05:48 PM]: Converting to final Genbank format
[05:48 PM]: Collecting final annotation files for 1,685 total gene models
[05:48 PM]: Funannotate predict is finished, output files are in the annotate/predict_results folder
[05:48 PM]: Your next step might be functional annotation, suggested commands:
-------------------------------------------------------
Run InterProScan (Docker required):
funannotate iprscan -i annotate -m docker -c 8

Run antiSMASH:
funannotate remote -i annotate -m antismash -e youremail@server.edu

Annotate Genome:
funannotate annotate -i annotate --cpus 8 --sbt yourSBTfile.txt
-------------------------------------------------------

[05:48 PM]: Training parameters file saved: annotate/predict_results/saccharomyces.parameters.json
[05:48 PM]: Add species parameters to database:

  funannotate species -s saccharomyces -a annotate/predict_results/saccharomyces.parameters.json

#########################################################
SUCCESS: `funannotate predict` test complete.
#########################################################

#########################################################
Running `funannotate predict` BUSCO-mediated training unit testing
CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate --cpus 8 --species Awesome busco
#########################################################
-------------------------------------------------------
[05:48 PM]: OS: Debian GNU/Linux 10, 24 cores, ~ 264 GB RAM. Python: 3.8.6
[05:48 PM]: Running funannotate v1.8.3
[05:48 PM]: Skipping CodingQuarry as no --rna_bam passed
[05:48 PM]: Parsed training data, run ab-initio gene predictors as follows:
  Program      Training-Method
  augustus     busco
  genemark     selftraining
  glimmerhmm   busco
  snap         busco
[05:48 PM]: Loading genome assembly and parsing soft-masked repetitive sequences
[05:48 PM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked
[05:48 PM]: Mapping 1,065 proteins to genome using diamond and exonerate
[05:48 PM]: Found 1,784 preliminary alignments --> aligning with exonerate
[05:49 PM]: Exonerate finished: found 1,435 alignments
[05:49 PM]: Running GeneMark-ES on assembly
[05:49 PM]: GeneMark-ES failed: annotate/predict_misc/genemark/output/gmhmm.mod file missing, please check logfiles.
[05:49 PM]: GeneMark predictions failed. If you can run GeneMark outside of funannotate, then pass the results to --genemark_gtf.
[05:49 PM]: Running BUSCO to find conserved gene models for training ab-initio predictors
[06:02 PM]: 175 valid BUSCO predictions found, validating protein sequences
[06:02 PM]: 175 BUSCO predictions validated
[06:02 PM]: Not enough gene models 175 to train Augustus (200 required), exiting
#########################################################
Traceback (most recent call last):
  File "/home/rpetit3/miniconda3/envs/funannotate/bin/funannotate", line 713, in <module>
    main()
  File "/home/rpetit3/miniconda3/envs/funannotate/bin/funannotate", line 703, in main
    mod.main(arguments)
  File "/home/rpetit3/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/test.py", line 406, in main
    runBuscoTest(args)
  File "/home/rpetit3/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/test.py", line 199, in runBuscoTest
    assert 1500 <= countGFFgenes(os.path.join(
  File "/home/rpetit3/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/test.py", line 44, in countGFFgenes
    with open(input, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'test-busco_3801/annotate/predict_results/Awesome_busco.gff3'

funannotate check output

OS/Install Information OS info:

uname -a
Linux loma 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 GNU/Linux

Versions info

funannotate check --show-versions
-------------------------------------------------------
Checking dependencies for 1.8.3
-------------------------------------------------------
You are running Python v 3.8.6. Now checking python packages...
biopython: 1.78
goatools: 1.0.15
matplotlib: 3.3.4
natsort: 7.1.1
numpy: 1.20.1
pandas: 1.2.2
psutil: 5.8.0
requests: 2.25.1
scikit-learn: 0.24.1
scipy: 1.6.0
seaborn: 0.11.1
All 11 python packages installed

You are running Perl v b'5.026002'. Now checking perl modules...
Bio::Perl: 1.007002
Carp: 1.38
Clone: 0.42
DBD::SQLite: 1.64
DBD::mysql: 4.046
DBI: 1.642
DB_File: 1.855
Data::Dumper: 2.173
File::Basename: 2.85
File::Which: 1.23
Getopt::Long: 2.5
Hash::Merge: 0.300
JSON: 4.02
LWP::UserAgent: 6.39
Logger::Simple: 2.0
POSIX: 1.76
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.12
Tie::File: 1.02
URI::Escape: 3.31
YAML: 1.29
threads: 2.15
threads::shared: 1.56
All 27 Perl modules installed

Checking Environmental Variables...
$FUNANNOTATE_DB=/home/rpetit3/funannotate_db
$PASAHOME=/home/rpetit3/miniconda3/envs/funannotate/opt/pasa-2.4.1
$TRINITY_HOME=/home/rpetit3/miniconda3/envs/funannotate/opt/trinity-2.8.5
$EVM_HOME=/home/rpetit3/miniconda3/envs/funannotate/opt/evidencemodeler-1.1.1
$AUGUSTUS_CONFIG_PATH=/home/rpetit3/miniconda3/envs/funannotate/config/
$GENEMARK_PATH=/home/rpetit3/miniconda3/envs/funannotate/share/genemark
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
PASA: 2.4.1
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.4.0
bamtools: bamtools 2.5.1
bedtools: bedtools v2.30.0
blat: BLAT v36
diamond: 2.0.7
ete3: 3.1.2
exonerate: exonerate 2.4.0
fasta: no way to determine
glimmerhmm: 3.0.4
gmap: 2017-11-15
hisat2: 2.2.1
hmmscan: HMMER 3.3.2 (Nov 2020)
hmmsearch: HMMER 3.3.2 (Nov 2020)
java: 11.0.8-internal
kallisto: 0.46.1
mafft: v7.475 (2020/Nov/23)
makeblastdb: makeblastdb 2.2.31+
minimap2: 2.17-r941
proteinortho: 6.0.28
pslCDnaFilter: no way to determine
salmon: salmon 0.14.1
samtools: samtools 1.10
snap: 2006-07-28
stringtie: 2.1.4
tRNAscan-SE: 2.0.7 (Oct 2020)
tantan: tantan 26
tbl2asn: no way to determine, likely 25.X
tblastn: tblastn 2.2.31+
trimal: trimAl v1.4.rev15 build[2013-12-17]
trimmomatic: 0.39
        ERROR: emapper.py not installed
        ERROR: gmes_petap.pl not installed
        ERROR: signalp not installed
nextgenusfs commented 3 years ago

Downgrade Augustus -- in the newest version they changed something and output is failing in BUSCO. So anything < 3.4 should work.

rpetit3 commented 3 years ago

Great! I'll give it a go. Want me to update the Conda recipe to pin Augustus to be <3.4?

rpetit3 commented 3 years ago

The downgrade on Augustus worked. Got another error, but I think its related to GeneMark. I'm getting a segmentation fault, but reached out to GaTech folks to hopefully get a working binary.

Does this look like an error that broken GeneMark binary would cause?

#########################################################
SUCCESS: `funannotate predict` BUSCO-mediated training test complete.
#########################################################
Now running predict using all pre-trained ab-initio predictors
CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate2 --cpus 8 --species Awesome busco -p annotate/predict_results/awesome_busco.parameters.json
#########################################################
-------------------------------------------------------
[01:22 AM]: OS: Debian GNU/Linux 10, 24 cores, ~ 264 GB RAM. Python: 3.8.6
[01:22 AM]: Running funannotate v1.8.3
[01:22 AM]: Ab initio training parameters file passed: annotate/predict_results/awesome_busco.parameters.json
Traceback (most recent call last):
  File "/home/rpetit3/miniconda3/envs/funannotate/bin/funannotate", line 713, in <module>
    main()
  File "/home/rpetit3/miniconda3/envs/funannotate/bin/funannotate", line 703, in main
    mod.main(arguments)
  File "/home/rpetit3/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/predict.py", line 499, in main
    shutil.copyfile(trainingData['genemark'][0]['path'], os.path.join(
  File "/home/rpetit3/miniconda3/envs/funannotate/lib/python3.8/shutil.py", line 261, in copyfile
    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/home/rpetit3/test-genemark/test-busco_4169/annotate/predict_misc/ab_initio_parameters/awesome_busco.genemark.mod'
#########################################################
Traceback (most recent call last):
  File "/home/rpetit3/miniconda3/envs/funannotate/bin/funannotate", line 713, in <module>
    main()
  File "/home/rpetit3/miniconda3/envs/funannotate/bin/funannotate", line 703, in main
    mod.main(arguments)
  File "/home/rpetit3/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/test.py", line 406, in main
    runBuscoTest(args)
  File "/home/rpetit3/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/test.py", line 217, in runBuscoTest
    assert 1500 <= countGFFgenes(os.path.join(
  File "/home/rpetit3/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/test.py", line 44, in countGFFgenes
    with open(input, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'test-busco_4169/annotate2/predict_results/Awesome_busco.gff3'`
nextgenusfs commented 3 years ago

If you could update conda recipe that would be awesome.

Second error possible if genemark failed in the first go than the json parameter file could have been pointing to a None or empty file.

rpetit3 commented 3 years ago

I'll update the recipe.

And yeah, I agree its probably GeneMark.

I'll close this for now, thanks for the help Jon!