empty training set: train/predict_misc/final_training_models.gff3

Lordhooze commented 4 years ago

Hello, So I've a slight issue. I want to predict genes. the species is duck. but the "final_training_models.gff3" file seems missing. The snap and glimmer software can not find the final_training_models.gff3 file. can you help me?

the following is the check log:

$ funannotate check --show-versions

Checking dependencies for 1.7.4

You are running Python v 2.7.15. Now checking python packages... biopython: 1.76 goatools: 1.0.3 matplotlib: 2.2.4 natsort: 6.2.0 numpy: 1.10.4 pandas: 0.24.2 psutil: 5.7.0 requests: 2.23.0 scikit-learn: 0.20.3 scipy: 1.2.1 seaborn: 0.9.0 All 11 python packages installed

You are running Perl v 5.026002. Now checking perl modules... Bio::Perl: 1.007002 Carp: 1.38 Clone: 0.42 DBD::SQLite: 1.64 DBD::mysql: 4.046 DBI: 1.642 DB_File: 1.852 Data::Dumper: 2.173 File::Basename: 2.85 File::Which: 1.23 Getopt::Long: 2.5 Hash::Merge: 0.300 JSON: 4.02 LWP::UserAgent: 6.39 Logger::Simple: 2.0 POSIX: 1.76 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.12 Tie::File: 1.02 URI::Escape: 3.31 YAML: 1.29 threads: 2.15 threads::shared: 1.56 All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/home/hujiaxiang/miniconda_install/envs/funannotate/funannotate_db $PASAHOME=/home/hujiaxiang/miniconda_install/envs/funannotate/opt/pasa-2.4.1 $TRINITY_HOME=/home/hujiaxiang/miniconda_install/envs/funannotate/opt/trinity-2.8.5 $EVM_HOME=/home/hujiaxiang/miniconda_install/envs/funannotate/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/home/hujiaxiang/miniconda_install/envs/funannotate/config/ $GENEMARK_PATH=/home/hujiaxiang/miniconda_install/envs/funannotate/external_app/gmes_linux_64 All 6 environmental variables are set

Checking external dependencies... PASA: 2.4.1 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.3.3 bamtools: bamtools 2.5.1 bedtools: bedtools v2.29.2 blat: BLAT v36 diamond: 0.9.21 emapper.py: 2.0.1 ete3: 3.1.1 exonerate: exonerate 2.4.0 fasta: no way to determine glimmerhmm: 3.0.4 gmap: 2017-11-15 gmes_petap.pl: 4.57_lic hisat2: 2.2.0 hmmscan: HMMER 3.3 (Nov 2019) hmmsearch: HMMER 3.3 (Nov 2019) java: 11.0.1-internal kallisto: 0.46.2 mafft: v7.464 (2020/Apr/21) makeblastdb: makeblastdb 2.2.31+ minimap2: 2.17-r941 proteinortho: 6.0.15 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.9 signalp: 4.1 snap: 2006-07-28 stringtie: 2.1.1 tRNAscan-SE: 2.0.5 (October 2019) tantan: tantan 13 tbl2asn: no way to determine, likely 25.X tblastn: tblastn 2.2.31+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 All 36 external dependencies are installed

The follwing is the code used in prediction

$ funannotate predict -i mask.fa -o train -s "chicken" --protein_evidence ../uniprot.fa --transcript_evidence ../trinity.fa --cpus 50 --max_intronlen 4000 --organism other --busco_db aves

[07:17 PM]: OS: linux2, 96 cores, ~ 528 GB RAM. Python: 2.7.15 [07:17 PM]: Running funannotate v1.7.4 [07:17 PM]: Found training files, will re-use these files: --rna_bam train/training/funannotate_train.coordSorted.bam --pasa_gff train/training/funannotate_train.pasa.gff3 --transcript_alignments train/training/funannotate_train.transcripts.gff3 [07:17 PM]: Parsed training data, run ab-initio gene predictors as follows: Program Training-Method augustus pretrained
codingquarry rna-bam
genemark selftraining
glimmerhmm pasa
snap pasa
[07:17 PM]: Loading genome assembly and parsing soft-masked repetitive sequences [07:17 PM]: Genome loaded: 2 scaffolds; 21,696,352 bp; 3.64% repeats masked [07:17 PM]: Parsed 3,424 transcript alignments from: train/training/funannotate_train.transcripts.gff3 [07:17 PM]: Aligning 71,097 unique transcripts [not found in exising alignments] with minimap2 [07:17 PM]: Mapped 0 of these transcripts to the genome [07:17 PM]: Creating transcript EVM alignments and Augustus transcripts hintsfile [07:17 PM]: Existing RNA-seq BAM hints found: train/predict_misc/hints.BAM.gff [07:17 PM]: Existing protein alignments found: train/predict_misc/protein_alignments.gff3 [07:17 PM]: Existing GeneMark annotation found: train/predict_misc/genemark.gff [07:17 PM]: 1,044 predictions from GeneMark [07:17 PM]: Existing Augustus annotations found: train/predict_misc/augustus.gff3 [07:17 PM]: Pulling out high quality Augustus predictions [07:17 PM]: Found 323 high quality predictions from Augustus (>90% exon evidence) [07:17 PM]: Using existing CodingQuarry results: train/predict_misc/coding_quarry.gff3 [07:17 PM]: 1,149 predictions from CodingQuarry [07:17 PM]: Snap training failed, empty training set: train/predict_misc/final_training_models.gff3 [07:17 PM]: GlimmerHMM training failed, empty training set: train/predict_misc/final_training_models.gff3 [07:17 PM]: Summary of gene models passed to EVM (weights): Source Weight Count Augustus 1 446
Augustus HiQ 2 323
CodingQuarry 2 1149 GeneMark 1 1044 pasa 6 671
Total - 3633 [07:18 PM]: Running EVM commands with 49 CPUs [07:19 PM]: Converting to GFF3 and collecting all EVM results [07:19 PM]: 1,419 total gene models from EVM [07:19 PM]: Generating protein fasta files from 1,419 EVM models [07:19 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [07:19 PM]: Found 56 gene models to remove: 3 too short; 1 span gaps; 68 transposable elements [07:19 PM]: 1,363 gene models remaining [07:19 PM]: Predicting tRNAs [07:19 PM]: 9 tRNAscan models are valid (non-overlapping) [07:19 PM]: Generating GenBank tbl annotation file [07:19 PM]: Converting to final Genbank format [07:20 PM]: Collecting final annotation files for 1,372 total gene models [07:20 PM]: Funannotate predict is finished, output files are in the train/predict_results folder [07:20 PM]: Your next step to capture UTRs and update annotation using PASA:

funannotate update -i train --cpus 50

[07:20 PM]: Training parameters file saved: train/predict_results/chicken.parameters.json [07:20 PM]: Add species parameters to database:

funannotate species -s chicken -a train/predict_results/chicken.parameters.json

Thanks in advance。

qihualiang commented 4 years ago

I also came across same issue when I added parameter --pasa_gff to compare annotation results with different inputs. SNAP and GlimmerHMM could not be trained to predict any genes when --pasa_gff was provided, even though EVM could still work and get gene models from other predictors. I was able to get both SNAP and GlimmerHMM worked and contributed to EVM with exact same script just without --pasa_gff. So I assume it is not an issue about installation or dependencies.

Log file: [06/11/20 04:07:28]: Snap training failed, empty training set: results/predict_misc/final_training_models.gff3 [06/11/20 04:07:28]: snap failed removing from training parameters [06/11/20 04:07:28]: GlimmerHMM training failed, empty training set: results/predict_misc/final_training_models.gff3 [06/11/20 04:07:28]: GlimmerHMM failed, removing from training parameters

I traced these error messages down in predict.py, and here are what I found: FinalTrainingModels was generated from BUSCO. But with --pasa_gff, RunModes for SNAP/GlimmerHMM were not BUSCO, so final_training_models.gff3 was not produced to train SNAP/GlimmerHMM.

@nextgenusfs Hi Jon, would you check to see how I should troubleshoot this? Thanks

nextgenusfs commented 4 years ago

Thanks for reporting -- seems like logic bug somewhere with PASA/BUSCO for training models.

nextgenusfs commented 4 years ago

I think this is fixed, you can try updates with python -m pip install git+https://github.com/nextgenusfs/funannotate.git in your conda environment.

nextgenusfs / funannotate