Species is recognized as "anidulans"?

Are you using the latest release? v1.8.15

Describe the bug

I am new to this and am having a hard time understanding the differences between --species, --busco_seed_species, and --augustus_species. I am annotating Botrytis cinerea, so I ran the code below but it appears that augustus is still running with the flag "--species=anidulans". I would like to use the pre-trained set (botrytis_cinerea). TIA!

What command did you issue?


#!/bin/bash --login
########### Define Resources Needed with SBATCH Lines ##########

#SBATCH --time=06:00:00             # limit of wall clock time - how long the job will run (same as -t)
#SBATCH --ntasks=1                  # number of tasks - how many tasks (nodes) that you require (same as -n)
#SBATCH --cpus-per-task=8           # number of CPUs (or cores) per task (same as -c)
#SBATCH --mem=120G                    # memory required per node - amount of memory (in bytes)
#SBATCH --job-name Predict_loop      # you can give your job a name for easier identification (same as -J)
#SBATCH -o Predict_loop3_slurm

########## Command Lines to Run ##########

module purge
conda activate funannotate

cd /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/SPAdes_assemblies

for infile in *_assembly

do

base=$(basename ${infile} _assembly)

funannotate predict -i  ${base}_assembly/${base}_CSM.fasta \
-o ../Predict_Annotate/${base}_fun \
--species "Botrytis cinerea" \
--isolate ${base} \
#--name QNM03 --SeqAccession SAMN35162124 \
--busco_seed_species botrytis_cinerea \
--augustus_species botrytis_cinerea \
--cpus 32

done

Logfiles

From slurm: /var/lib/slurmd/job16193554/slurm_script: line 30: --busco_seed_species: command not found

First isolate in loop was successful? AF13_funannotate-predict.log

From B5 funannotate-predict.log B5_funannotate-predict.log

[05/30/23 22:09:01]: /mnt/home/lukaskon/anaconda3/envs/funannotate/bin/funannotate predict -i B5_assembly/B5_CSM.fasta -o ../Predict_Annotate/B5_fun --species Botrytis cinerea --isolate B5

[05/30/23 22:09:01]: OS: CentOS Linux 7, 128 cores, ~ 528 GB RAM. Python: 3.8.16 [05/30/23 22:09:01]: Running funannotate v1.8.15 [05/30/23 22:09:01]: GeneMark not found and $GENEMARK_PATH environmental variable missing. Will skip GeneMark ab-initio prediction. 05/30/23 22:09:02: exonerate version=exonerate 2.4.0 path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/exonerate 05/30/23 22:09:02: diamond version=2.1.6 path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/diamond 05/30/23 22:09:02: tbl2asn version=25.8 path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/tbl2asn 05/30/23 22:09:02: bedtools version=bedtools v2.31.0 path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/bedtools 05/30/23 22:09:02: augustus version=3.5.0 path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/augustus 05/30/23 22:09:02: etraining version=NA path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/etraining 05/30/23 22:09:02: tRNAscan-SE version=2.0.11 (Oct 2022) path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/tRNAscan-SE 05/30/23 22:09:02: bam2hints version=NA path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/bam2hints 05/30/23 22:09:02: minimap2 version=2.26-r1175 path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/minimap2

05/30/23 22:09:02: {'augustus': 1, 'hiq': 2, 'genemark': 0, 'pasa': 6, 'codingquarry': 0, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1} 05/30/23 22:09:02: Skipping CodingQuarry as no --rna_bam passed 05/30/23 22:09:02: {'augustus': 'busco', 'snap': 'busco', 'glimmerhmm': 'busco'} 05/30/23 22:09:02: Parsed training data, run ab-initio gene predictors as follows: 05/30/23 22:09:02: augustus --species=anidulans --proteinprofile=/mnt/home/lukaskon/anaconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/config/EOG092C0 B3U.prfl /mnt/home/lukaskon/anaconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/config/busco_test.fa [05/30/23 22:09:03]: {'augustus': 1, 'hiq': 2, 'genemark': 0, 'pasa': 6, 'codingquarry': 0, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1} [05/30/23 22:09:11]: Loading genome assembly and parsing soft-masked repetitive sequences [05/30/23 22:09:15]: Genome loaded: 410 scaffolds; 42,498,250 bp; 6.36% repeats masked

OS/Install Information

Checking dependencies for 1.8.15

You are running Python v 3.8.16. Now checking python packages... biopython: 1.81 goatools: 1.2.3 matplotlib: 3.4.3 natsort: 8.3.1 numpy: 1.24.3 pandas: 1.5.3 psutil: 5.9.5 requests: 2.31.0 scikit-learn: 1.2.2 scipy: 1.10.1 seaborn: 0.12.2 All 11 python packages installed

You are running Perl v b'5.032001'. Now checking perl modules... Carp: 1.50 Clone: 0.46 DBD::SQLite: 1.72 DBD::mysql: 4.046 DBI: 1.643 DB_File: 1.858 Data::Dumper: 2.183 File::Basename: 2.85 File::Which: 1.24 Getopt::Long: 2.54 Hash::Merge: 0.302 JSON: 4.10 LWP::UserAgent: 6.67 Logger::Simple: 2.0 POSIX: 1.94 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.14 Tie::File: 1.06 URI::Escape: 5.17 YAML: 1.30 local::lib: 2.000029 threads: 2.25 threads::shared: 1.61 All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/mnt/home/lukaskon/funannotate_db $PASAHOME=/mnt/home/lukaskon/anaconda3/envs/funannotate/opt/pasa-2.5.2 $TRINITY_HOME=/mnt/home/lukaskon/anaconda3/envs/funannotate/opt/trinity-2.8.5 $EVM_HOME=/mnt/home/lukaskon/anaconda3/envs/funannotate/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/mnt/home/lukaskon/anaconda3/envs/funannotate/config/ ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir

Checking external dependencies... samtools: /mnt/ufs18/home-054/lukaskon/anaconda3/envs/funannotate/bin/../lib/libtinfow.so.6: no version information available (required by samtools) samtools: /mnt/ufs18/home-054/lukaskon/anaconda3/envs/funannotate/bin/../lib/libncursesw.so.6: no version information available (required by samtools) samtools: /mnt/ufs18/home-054/lukaskon/anaconda3/envs/funannotate/bin/../lib/libncursesw.so.6: no version information available (required by samtools) PASA: 2.5.2 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.5.0 bamtools: bamtools 2.5.1 bedtools: bedtools v2.31.0 blat: BLAT v37x1 diamond: 2.1.6 ete3: 3.1.2 exonerate: exonerate 2.4.0 fasta: 36.3.8g glimmerhmm: 3.0.4 gmap: 2023-04-28 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 17.0.3-internal kallisto: 0.46.1 mafft: v7.520 (2023/Mar/22) makeblastdb: makeblastdb 2.14.0+ minimap2: 2.26-r1175 pigz: 2.6 proteinortho: 6.2.3 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.16.1 snap: 2006-07-28 stringtie: 2.2.1 tRNAscan-SE: 2.0.11 (Oct 2022) tantan: tantan 40 tbl2asn: 25.8 tblastn: tblastn 2.14.0+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: emapper.py not installed ERROR: gmes_petap.pl not installed ERROR: signalp not installed

I am having the same issue when I go to annotate:

module purge
conda activate funannotate

cd /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/Predict_Annotate

for infile in AI*

do

base=$(basename ${infile} _fun)

cd ${base}_fun

funannotate annotate -i predict_results \
--species "Botrytis cinerea" \
--iprscan predict_results/${base}_ipr.xml \
--antismash predict_results/${base}_smash/${base}_smash.gbk \
--signalp predict_results/signalp/prediction_results.txt \
--busco_db helotiales_odb10 \
--isolate BM16 \
--cpus 24 \
--force

cd ../

done

conda deactivate

BUSCO version is: 2.0 The lineage dataset is: helotiales_odb10 (Creation date: 2020-08-05, number of species: 14, number of BUSCOs: 5177) To reproduce this run: python /mnt/home/lukaskon/anaconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /mnt/ufs18/rs -008/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/Predict_Annotate/AI7_fun/annotate_misc/genome.proteins.fasta -o busco -l /mnt/home/lukaskon/funannotate_db/helotiales_od b10/ -m proteins -c 24 -sp aspergillus_nidulans

I don't think the species flag is actually doing anything here but regardless, why is it not recognizing the correct species?

I think these have been asked and answered before in the issues like #645 and #251 other queries but:

--species : name of the species you are annotating, output file, taxonomy string, etc will be populated with this
--busco_seed_species : when running training, BUSCO is used to identify conserved genes, what starting augustus gene parameters are used in running prediction, of course ideal is to have a pre-trained prediction set for your species which you can generate in the process of running funannotate, and re-run it with this trained params should you choose (predict_misc/ab_initio_parameters/species_strain - where species and strain/isolate are the name you gave above (there is an --isolate and --strain option to give to funannotate as well)
--augustus_species : if you have a pre-trained augustus parameter set in env var $AUGUSTUS_CONFIG_PATH or in your $FUNANNOTATE_DB/trained_species you can provide this and funannotate won't attempt to train augustus but will instead use this pre-trained parameters, can save time. This is useful if you are annotating several strains of same species and want to reuse prediction parameters across these.

I don’t understand what your error is here? Can you be more specific. The species flag is set at the predict step and carried forward one doesn’t assume you would change what is the species from predict to annotate step.

As I thought I indicated, The species flag is setting the name for the output file and description of this genome - it has nothing to do with the Busco training other than what the resulting trained files will be clalled.

The error in your first bug report seems is because your attempt at multi line command land was thrown off by the comment line. You can’t interrupt it with the comment line so everything after that is not part of the funannotate command.

On Wed, Jun 14, 2023 at 4:38 PM Nikki Lukasko @.***> wrote:

I am having the same issue when I go to annotate:

module purge conda activate funannotate

cd /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/Predict_Annotate

for infile in AI*

do

base=$(basename ${infile} _fun)

cd ${base}_fun

funannotate annotate -i predict_results \ --species "Botrytis cinerea" \ --iprscan predict_results/${base}_ipr.xml \ --antismash predict_results/${base}_smash/${base}_smash.gbk \ --signalp predict_results/signalp/prediction_results.txt \ --busco_db helotiales_odb10 \ --isolate BM16 \ --cpus 24 \ --force

cd ../

done

conda deactivate

BUSCO version is: 2.0 The lineage dataset is: helotiales_odb10 (Creation date: 2020-08-05, number of species: 14, number of BUSCOs: 5177) To reproduce this run: python /mnt/home/lukaskon/anaconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /mnt/ufs18/rs -008/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/Predict_Annotate/AI7_fun/annotate_misc/genome.proteins.fasta -o busco -l /mnt/home/lukaskon/funannotate_db/helotiales_od b10/ -m proteins -c 24 -sp aspergillus_nidulans

I don't think the species flag is actually doing anything here but regardless, why is it not recognizing the correct species?

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/920#issuecomment-1592127128, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL5O46KGF6PXIEBLI7PATXLJDOBANCNFSM6AAAAAAYVYU5WA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Sent from Gmail Mobile

Jason Stajich - @.***

well, in the first comment I think she is reffering to the line containing "augustus --species=anidulans" the output gives. It happens to me and everyone else in our lab too. Is this relevant or it does not change anything and it's just an output error? Thanks a lot

hi - the species flag for protein annotation doesn't matter. The annotate step is only matching proteins it isn't running augustus anymore - that only happens with predict step and an unannotated genome.

I'll check if @nextgenusfs thinks we can totally remove that flag -sp flag when running busco in this mode -- but the -m protein in BUSCO means it is only doing a protein comparison and not invoking augustus at all.

to answer the first question in predict

--busco_seed_species botrytis_cinerea -- this is what augustus starting species to use when running the BUSCO runs. BUSCO can start with default parameter set and will run refinement on training augustus to achieve better gene models for these targeted set of BUSCO loci if the --long option is added. The performance of this really depends on the species and how far away the starting model set is - It isn't always clear this extra computational time really benefits to produce more accurate gene models as evidence (protein and RNAseq) often have a much higher impact.
--augustus_species botrytis_cinerea -- this means what pre-trained augustus species parameters do you want to use. For many projects it is better to let funannotate train gene predictors using the BUSCO protein alignments (or better yet the RNAseq transcripts) however you can set a pre-trained model for the predict step in augustus. It will still need to develop a training set for SNAP predictions if you want to include that in the composite gene model prediction. The observations about whether additional re-training and optimization has a major impact - the --optimize-augustus option can be invoked to add additional levels of optimization but its relative impact varies a lot and is not necessarily worth the extra time.
--species this defines what is the species name - this will be what is included in the GenBank record, and determine how the output file names. -you should do --species "Botrytis cinerea" when you run it - it will automatically add underscores when naming the output files that go in predict_results - the filenames will be SPECIES_STRAIN where any spaces are converted to underscores.

nextgenusfs / funannotate

Species is recognized as "anidulans"? #920

Checking dependencies for 1.8.15