nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
314 stars 83 forks source link

Genemark_gtf and SNAP query #680

Open Imogen-D opened 2 years ago

Imogen-D commented 2 years ago

Hi, I'm parsing a genemark_gtf into funannotate predict as utilising genemark_path isn't working (through various rounds of trial and error, I'm not sure why as gmes_petap.pl works as a stand alone script and I've read through previous issues regarding it). How do I know if this genemark_gtf file has been used? In addition, SNAP shows multiple errors (exon out_of_bounds) which, looking through other documentation appears to occur if there are multiple different input files but this is not the case here. Since snap found 3966 genes and 450 of them have the error is this a major concern? Here is my output, I have removed most of the SNAP errors for clarity

[12/13/21 17:10:00]: /venv/bin/funannotate predict --cpus 20 -i /media/bigvol/idumville/05funannotate/01softmaskgenomes/1111finalwomito.fasta.masked -o /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm -s Pseudogymnoascus destructans --genemark_gtf genemark.gtf --force

[12/13/21 17:10:00]: OS: Debian GNU/Linux 10, 96 cores, ~ 1057 GB RAM. Python: 3.8.12
[12/13/21 17:10:00]: Running funannotate v1.8.10
[12/13/21 17:10:00]: GeneMark not found and $GENEMARK_PATH environmental variable missing. Will skip GeneMark ab-initio prediction.
[12/13/21 17:10:00]: {'augustus': 1, 'hiq': 2, 'genemark': 0, 'pasa': 6, 'codingquarry': 0, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1}
[12/13/21 17:10:00]: Skipping CodingQuarry as no --rna_bam passed
[12/13/21 17:10:00]: {'augustus': 'busco', 'snap': 'busco', 'glimmerhmm': 'busco'}
[12/13/21 17:10:00]: Parsed training data, run ab-initio gene predictors as follows:
[12/13/21 17:10:01]: {'augustus': 1, 'hiq': 2, 'genemark': 0, 'pasa': 6, 'codingquarry': 0, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1}
[12/13/21 17:10:08]: Loading genome assembly and parsing soft-masked repetitive sequences
[12/13/21 17:10:09]: Genome loaded: 29 scaffolds; 34,598,246 bp; 39.64% repeats masked
[12/13/21 18:14:42]: join_mult_hints.pl
[12/13/21 18:14:42]: /venv/lib/python3.8/site-packages/funannotate/aux_scripts/genemark_gtf2gff3.pl genemark.gtf
[12/13/21 18:14:43]: perl /venv/opt/evidencemodeler-1.1.1/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/genemark.gff
[12/13/21 18:14:44]: Running BUSCO to find conserved gene models for training ab-initio predictors
[12/13/21 18:14:44]: /venv/bin/python /venv/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/genome.softmasked.fa -m genome --lineage /opt/databases/dikarya -o pseudogymnoascus_destructans -c 20 --species anidulans -f --local_augustus /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/ab_initio_parameters/augustus
[12/13/21 18:20:15]: 1,281 valid BUSCO predictions found, validating protein sequences
[12/13/21 18:21:11]: 1,279 BUSCO predictions validated
[12/13/21 18:21:11]: Training Augustus using BUSCO gene models
[12/13/21 18:21:11]: gff2gbSmallDNA.pl /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/busco.final.gff3 /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/genome.softmasked.fa 600 /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/augustus.training.busco.gb
[12/13/21 18:21:26]: Augustus initial training results:
[12/13/21 18:21:26]: Running Augustus gene prediction using pseudogymnoascus_destructans parameters
[12/13/21 18:23:20]: perl /venv/opt/evidencemodeler-1.1.1/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/augustus.gff3
[12/13/21 18:23:20]: Pulling out high quality Augustus predictions
[12/13/21 18:23:21]: Found 43 high quality predictions from Augustus (>90% exon evidence)
[12/13/21 18:23:21]: Running SNAP gene prediction, using training data: /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/busco.final.gff3
[12/13/21 18:23:21]: 1279 gene models to train snap on 22 scaffolds
[12/13/21 18:23:22]: fathom /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/snap.training.zff /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/snap-training.scaffolds.fasta -categorize 1000 -min-intron 10 -max-intron 3000
[12/13/21 18:23:23]: gene589.t1 1 1 3 - errors(3): exon-3:out_of_bounds exon-2:out_of_bounds exon-1:out_of_bounds
gene718.t1 1 1 4 - errors(4): exon-4:out_of_bounds exon-3:out_of_bounds exon-2:out_of_bounds exon-1:out_of_bounds
gene1142.t1 1 1 1 + errors(1): exon-1:out_of_bounds

...

gene1192.t1 1 1 5 + errors(5): exon-1:out_of_bounds exon-2:out_of_bounds exon-3:out_of_bounds exon-4:out_of_bounds exon-5:out_of_bounds
gene868.t1 1 1 3 + errors(3): exon-1:out_of_bounds exon-2:out_of_bounds exon-3:out_of_bounds

[12/13/21 18:23:23]: fathom uni.ann uni.dna -export 1000 -plus
[12/13/21 18:23:23]: forge export.ann export.dna
[12/13/21 18:23:24]: perl /usr/bin/hmm-assembler.pl snap-trained /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/snaptrain
[12/13/21 18:23:24]: snap /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/snap-trained.hmm /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/genome.softmasked.fa
[12/13/21 18:25:21]: 3,996 predictions from SNAP
[12/13/21 18:25:21]: Running GlimmerHMM gene prediction, using training data: /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/busco.final.gff3
[12/13/21 18:25:22]: trainGlimmerHMM /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/genome.softmasked.fa /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/glimmer.exons -d /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/glimmerhmm
[12/13/21 18:28:03]: perl /venv/bin/glimmhmm.pl /venv/bin/glimmerhmm /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/genome.softmasked.fa /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/glimmerhmm -g
[12/13/21 18:31:12]: 12,111 predictions from GlimmerHMM
[12/13/21 18:31:12]: Prediction sources: ['Augustus', 'HiQ', 'GeneMark', 'GlimmerHMM', 'snap']
[12/13/21 18:31:13]: Summary of gene models: {'total': 30708, 'Augustus': 6351, 'HiQ': 43, 'GeneMark': 8207, 'GlimmerHMM': 12111, 'snap': 3996}
[12/13/21 18:31:13]: EVM Weights: {'Augustus': 1, 'HiQ': 2, 'GeneMark': 0, 'GlimmerHMM': 1, 'snap': 1, 'proteins': 1}
[12/13/21 18:31:13]: Summary of gene models passed to EVM (weights):
[12/13/21 18:31:13]: Launching EVM via funannotate-runEVM.py
[12/13/21 18:31:13]: /venv/bin/python /venv/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-runEVM.py -w /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/weights.evm.txt -c 20 -g /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/gene_predictions.gff3 -d /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/EVM -f /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/genome.softmasked.fa -l /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/logfiles/funannotate-EVM.log -m 10 -i 1500 -o /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/evm.round1.gff3 --EVM_HOME /venv/opt/evidencemodeler-1.1.1 -p /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/protein_alignments.gff3
[12/13/21 18:33:46]: 7,331 total gene models from EVM
[12/13/21 18:33:46]: Generating protein fasta files from 7,331 EVM models
[12/13/21 18:33:49]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[12/13/21 18:33:49]: diamond blastp --sensitive --query /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/evm.round1.proteins.fa --threads 20 --out /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/repeats.xml --db /opt/databases/repeats.dmnd --evalue 1e-10 --max-target-seqs 1 --outfmt 5
[12/13/21 18:33:52]: bedtools intersect -sorted -f 0.9 -a /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/evm.round1.gff3.sorted.gff -b /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/repeatmasker.bed.sorted.bed
[12/13/21 18:33:52]: Found 328 gene models to remove: 0 too short; 0 span gaps; 328 transposable elements
[12/13/21 18:33:52]: 7,003 gene models remaining
[12/13/21 18:33:52]: Predicting tRNAs
[12/13/21 18:33:52]: tRNAscan-SE -o /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/tRNAscan.out --thread 20 /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/genome.softmasked.fa
[12/13/21 18:34:35]: Status: Phase I: Searching for tRNAs with HMM-enabled Infernal
Status: Phase II: Infernal verification of candidate tRNAs detected with first-pass scan

[12/13/21 18:34:35]: 
tRNAscan-SE v.2.0.9 (July 2021) - scan sequences for transfer RNAs
Copyright (C) 2020 Patricia Chan and Todd Lowe
                   University of California Santa Cruz
Freely distributed under the GNU General Public License (GPLv3)

------------------------------------------------------------
Sequence file(s) to search:        /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/genome.softmasked.fa
Search Mode:                       Eukaryotic
Results written to:                /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/tRNAscan.out
Output format:                     Tabular
Searching with:                    Infernal First Pass->Infernal
Isotype-specific model scan:       Yes
Covariance model:                  /venv/lib/tRNAscan-SE/models/TRNAinf-euk.cm
                                   /venv/lib/tRNAscan-SE/models/TRNAinf-euk-SeC.cm
Infernal first pass cutoff score:  10

Temporary directory:               /tmp
------------------------------------------------------------

[12/13/21 18:34:35]: bedtools intersect -sorted -v -a /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/trnascan.gff3.sorted.gff3 -b /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/evm.cleaned.gff3.sorted.gff3 /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/assembly-gaps.bed.sorted.gff3
[12/13/21 18:34:36]: 120 tRNAscan models are valid (non-overlapping)
[12/13/21 18:34:36]: Generating GenBank tbl annotation file
[12/13/21 18:34:49]: Collecting final annotation files for 7,123 total gene models
[12/13/21 18:34:49]: Converting to final Genbank format
[12/13/21 18:34:49]: /venv/bin/python /venv/lib/python3.8/site-packages/funannotate/aux_scripts/tbl2asn_parallel.py -i /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/tbl2asn/genome.tbl -f /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/genome.softmasked.fa -o /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_misc/tbl2asn --sbt /venv/lib/python3.8/site-packages/funannotate/config/test.sbt -d /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_results/Pseudogymnoascus_destructans.discrepency.report.txt -s Pseudogymnoascus destructans -t -l paired-ends -v 1 -c 20
[12/13/21 18:35:11]: Funannotate predict is finished, output files are in the /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_results folder
[12/13/21 18:35:11]: Your next step might be functional annotation, suggested commands:
-------------------------------------------------------
Run InterProScan (manual install): 
funannotate iprscan -i /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm -c 20

Run antiSMASH (optional): 
funannotate remote -i /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm -m antismash -e youremail@server.edu

Annotate Genome: 
funannotate annotate -i /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm --cpus 20 --sbt yourSBTfile.txt
-------------------------------------------------------

[12/13/21 18:35:11]: Training parameters file saved: /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_results/pseudogymnoascus_destructans.parameters.json
[12/13/21 18:35:11]: Add species parameters to database:

  funannotate species -s pseudogymnoascus_destructans -a /media/bigvol/idumville/05funannotate/03funannotatewgm/1111_funannotate_imogen_01_gm/predict_results/pseudogymnoascus_destructans.parameters.json
nextgenusfs commented 2 years ago

So it did parse genemark models

[12/13/21 18:31:13]: Summary of gene models: {'total': 30708, 'Augustus': 6351, 'HiQ': 43, 'GeneMark': 8207, 'GlimmerHMM': 12111, 'snap': 3996}

However, because it turned the weights to 0 for genemark because it is not in your PATH nor is GENEMARK_PATH set, those models did not get passed to EVM.

[12/13/21 18:31:13]: EVM Weights: {'Augustus': 1, 'HiQ': 2, 'GeneMark': 0, 'GlimmerHMM': 1, 'snap': 1, 'proteins': 1}

So you just need to have gmes_petap.pl in your PATH, or you can export the envinromental variable GENEMARK_PATH, ie

export GENEMARK_PATH=/the/actual/path/to/dir/containing-genemark
nextgenusfs commented 2 years ago

Also, it seems like snap might be broken on your install -- if it is installed through conda on debian systems it is a known bug. You can delete the one from conda, ie conda uninstall -n your_env snap --force and then install with apt-get package manager, sudo apt-get install snap.

Imogen-D commented 2 years ago

Thanks! Didn't realise I still needed the GENEMARK_PATH exported. I also fixed the SNAP issue, I'd used a file with different fasta headers for the genemark input originally.