nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 85 forks source link

Species is recognized as "anidulans"? #920

Open lukaskon opened 1 year ago

lukaskon commented 1 year ago

Are you using the latest release? v1.8.15

Describe the bug

I am new to this and am having a hard time understanding the differences between --species, --busco_seed_species, and --augustus_species. I am annotating Botrytis cinerea, so I ran the code below but it appears that augustus is still running with the flag "--species=anidulans". I would like to use the pre-trained set (botrytis_cinerea). TIA!

What command did you issue?


#!/bin/bash --login
########### Define Resources Needed with SBATCH Lines ##########

#SBATCH --time=06:00:00             # limit of wall clock time - how long the job will run (same as -t)
#SBATCH --ntasks=1                  # number of tasks - how many tasks (nodes) that you require (same as -n)
#SBATCH --cpus-per-task=8           # number of CPUs (or cores) per task (same as -c)
#SBATCH --mem=120G                    # memory required per node - amount of memory (in bytes)
#SBATCH --job-name Predict_loop      # you can give your job a name for easier identification (same as -J)
#SBATCH -o Predict_loop3_slurm

########## Command Lines to Run ##########

module purge
conda activate funannotate

cd /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/SPAdes_assemblies

for infile in *_assembly

do

base=$(basename ${infile} _assembly)

funannotate predict -i  ${base}_assembly/${base}_CSM.fasta \
-o ../Predict_Annotate/${base}_fun \
--species "Botrytis cinerea" \
--isolate ${base} \
#--name QNM03 --SeqAccession SAMN35162124 \
--busco_seed_species botrytis_cinerea \
--augustus_species botrytis_cinerea \
--cpus 32

done

Logfiles

From slurm: /var/lib/slurmd/job16193554/slurm_script: line 30: --busco_seed_species: command not found

First isolate in loop was successful? AF13_funannotate-predict.log

From B5 funannotate-predict.log B5_funannotate-predict.log

[05/30/23 22:09:01]: /mnt/home/lukaskon/anaconda3/envs/funannotate/bin/funannotate predict -i B5_assembly/B5_CSM.fasta -o ../Predict_Annotate/B5_fun --species Botrytis cinerea --isolate B5

[05/30/23 22:09:01]: OS: CentOS Linux 7, 128 cores, ~ 528 GB RAM. Python: 3.8.16 [05/30/23 22:09:01]: Running funannotate v1.8.15 [05/30/23 22:09:01]: GeneMark not found and $GENEMARK_PATH environmental variable missing. Will skip GeneMark ab-initio prediction. 05/30/23 22:09:02: exonerate version=exonerate 2.4.0 path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/exonerate 05/30/23 22:09:02: diamond version=2.1.6 path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/diamond 05/30/23 22:09:02: tbl2asn version=25.8 path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/tbl2asn 05/30/23 22:09:02: bedtools version=bedtools v2.31.0 path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/bedtools 05/30/23 22:09:02: augustus version=3.5.0 path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/augustus 05/30/23 22:09:02: etraining version=NA path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/etraining 05/30/23 22:09:02: tRNAscan-SE version=2.0.11 (Oct 2022) path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/tRNAscan-SE 05/30/23 22:09:02: bam2hints version=NA path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/bam2hints 05/30/23 22:09:02: minimap2 version=2.26-r1175 path=/mnt/home/lukaskon/anaconda3/envs/funannotate/bin/minimap2

05/30/23 22:09:02: {'augustus': 1, 'hiq': 2, 'genemark': 0, 'pasa': 6, 'codingquarry': 0, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1} 05/30/23 22:09:02: Skipping CodingQuarry as no --rna_bam passed 05/30/23 22:09:02: {'augustus': 'busco', 'snap': 'busco', 'glimmerhmm': 'busco'} 05/30/23 22:09:02: Parsed training data, run ab-initio gene predictors as follows: 05/30/23 22:09:02: augustus --species=anidulans --proteinprofile=/mnt/home/lukaskon/anaconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/config/EOG092C0 B3U.prfl /mnt/home/lukaskon/anaconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/config/busco_test.fa [05/30/23 22:09:03]: {'augustus': 1, 'hiq': 2, 'genemark': 0, 'pasa': 6, 'codingquarry': 0, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1} [05/30/23 22:09:11]: Loading genome assembly and parsing soft-masked repetitive sequences [05/30/23 22:09:15]: Genome loaded: 410 scaffolds; 42,498,250 bp; 6.36% repeats masked

OS/Install Information

Checking dependencies for 1.8.15

You are running Python v 3.8.16. Now checking python packages... biopython: 1.81 goatools: 1.2.3 matplotlib: 3.4.3 natsort: 8.3.1 numpy: 1.24.3 pandas: 1.5.3 psutil: 5.9.5 requests: 2.31.0 scikit-learn: 1.2.2 scipy: 1.10.1 seaborn: 0.12.2 All 11 python packages installed

You are running Perl v b'5.032001'. Now checking perl modules... Carp: 1.50 Clone: 0.46 DBD::SQLite: 1.72 DBD::mysql: 4.046 DBI: 1.643 DB_File: 1.858 Data::Dumper: 2.183 File::Basename: 2.85 File::Which: 1.24 Getopt::Long: 2.54 Hash::Merge: 0.302 JSON: 4.10 LWP::UserAgent: 6.67 Logger::Simple: 2.0 POSIX: 1.94 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.14 Tie::File: 1.06 URI::Escape: 5.17 YAML: 1.30 local::lib: 2.000029 threads: 2.25 threads::shared: 1.61 All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/mnt/home/lukaskon/funannotate_db $PASAHOME=/mnt/home/lukaskon/anaconda3/envs/funannotate/opt/pasa-2.5.2 $TRINITY_HOME=/mnt/home/lukaskon/anaconda3/envs/funannotate/opt/trinity-2.8.5 $EVM_HOME=/mnt/home/lukaskon/anaconda3/envs/funannotate/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/mnt/home/lukaskon/anaconda3/envs/funannotate/config/ ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir

Checking external dependencies... samtools: /mnt/ufs18/home-054/lukaskon/anaconda3/envs/funannotate/bin/../lib/libtinfow.so.6: no version information available (required by samtools) samtools: /mnt/ufs18/home-054/lukaskon/anaconda3/envs/funannotate/bin/../lib/libncursesw.so.6: no version information available (required by samtools) samtools: /mnt/ufs18/home-054/lukaskon/anaconda3/envs/funannotate/bin/../lib/libncursesw.so.6: no version information available (required by samtools) PASA: 2.5.2 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.5.0 bamtools: bamtools 2.5.1 bedtools: bedtools v2.31.0 blat: BLAT v37x1 diamond: 2.1.6 ete3: 3.1.2 exonerate: exonerate 2.4.0 fasta: 36.3.8g glimmerhmm: 3.0.4 gmap: 2023-04-28 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 17.0.3-internal kallisto: 0.46.1 mafft: v7.520 (2023/Mar/22) makeblastdb: makeblastdb 2.14.0+ minimap2: 2.26-r1175 pigz: 2.6 proteinortho: 6.2.3 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.16.1 snap: 2006-07-28 stringtie: 2.2.1 tRNAscan-SE: 2.0.11 (Oct 2022) tantan: tantan 40 tbl2asn: 25.8 tblastn: tblastn 2.14.0+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: emapper.py not installed ERROR: gmes_petap.pl not installed ERROR: signalp not installed

lukaskon commented 1 year ago

I am having the same issue when I go to annotate:

module purge
conda activate funannotate

cd /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/Predict_Annotate

for infile in AI*

do

base=$(basename ${infile} _fun)

cd ${base}_fun

funannotate annotate -i predict_results \
--species "Botrytis cinerea" \
--iprscan predict_results/${base}_ipr.xml \
--antismash predict_results/${base}_smash/${base}_smash.gbk \
--signalp predict_results/signalp/prediction_results.txt \
--busco_db helotiales_odb10 \
--isolate BM16 \
--cpus 24 \
--force

cd ../

done

conda deactivate

BUSCO version is: 2.0 The lineage dataset is: helotiales_odb10 (Creation date: 2020-08-05, number of species: 14, number of BUSCOs: 5177) To reproduce this run: python /mnt/home/lukaskon/anaconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /mnt/ufs18/rs -008/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/Predict_Annotate/AI7_fun/annotate_misc/genome.proteins.fasta -o busco -l /mnt/home/lukaskon/funannotate_db/helotiales_od b10/ -m proteins -c 24 -sp aspergillus_nidulans

I don't think the species flag is actually doing anything here but regardless, why is it not recognizing the correct species?

hyphaltip commented 1 year ago

I think these have been asked and answered before in the issues like #645 and #251 other queries but:

hyphaltip commented 1 year ago

I don’t understand what your error is here? Can you be more specific. The species flag is set at the predict step and carried forward one doesn’t assume you would change what is the species from predict to annotate step.

As I thought I indicated, The species flag is setting the name for the output file and description of this genome - it has nothing to do with the Busco training other than what the resulting trained files will be clalled.

The error in your first bug report seems is because your attempt at multi line command land was thrown off by the comment line. You can’t interrupt it with the comment line so everything after that is not part of the funannotate command.

On Wed, Jun 14, 2023 at 4:38 PM Nikki Lukasko @.***> wrote:

I am having the same issue when I go to annotate:

module purge conda activate funannotate

cd /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/Predict_Annotate

for infile in AI*

do

base=$(basename ${infile} _fun)

cd ${base}_fun

funannotate annotate -i predict_results \ --species "Botrytis cinerea" \ --iprscan predict_results/${base}_ipr.xml \ --antismash predict_results/${base}_smash/${base}_smash.gbk \ --signalp predict_results/signalp/prediction_results.txt \ --busco_db helotiales_odb10 \ --isolate BM16 \ --cpus 24 \ --force

cd ../

done

conda deactivate

BUSCO version is: 2.0 The lineage dataset is: helotiales_odb10 (Creation date: 2020-08-05, number of species: 14, number of BUSCOs: 5177) To reproduce this run: python /mnt/home/lukaskon/anaconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /mnt/ufs18/rs -008/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/Predict_Annotate/AI7_fun/annotate_misc/genome.proteins.fasta -o busco -l /mnt/home/lukaskon/funannotate_db/helotiales_od b10/ -m proteins -c 24 -sp aspergillus_nidulans

I don't think the species flag is actually doing anything here but regardless, why is it not recognizing the correct species?

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/920#issuecomment-1592127128, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL5O46KGF6PXIEBLI7PATXLJDOBANCNFSM6AAAAAAYVYU5WA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Sent from Gmail Mobile

Jason Stajich - @.***

JoseLopezArcondo commented 1 month ago

well, in the first comment I think she is reffering to the line containing "augustus --species=anidulans" the output gives. It happens to me and everyone else in our lab too. Is this relevant or it does not change anything and it's just an output error? Thanks a lot

hyphaltip commented 1 month ago

hi - the species flag for protein annotation doesn't matter. The annotate step is only matching proteins it isn't running augustus anymore - that only happens with predict step and an unannotated genome.

I'll check if @nextgenusfs thinks we can totally remove that flag -sp flag when running busco in this mode -- but the -m protein in BUSCO means it is only doing a protein comparison and not invoking augustus at all.

hyphaltip commented 1 month ago

to answer the first question in predict

  1. --busco_seed_species botrytis_cinerea -- this is what augustus starting species to use when running the BUSCO runs. BUSCO can start with default parameter set and will run refinement on training augustus to achieve better gene models for these targeted set of BUSCO loci if the --long option is added. The performance of this really depends on the species and how far away the starting model set is - It isn't always clear this extra computational time really benefits to produce more accurate gene models as evidence (protein and RNAseq) often have a much higher impact.
  2. --augustus_species botrytis_cinerea -- this means what pre-trained augustus species parameters do you want to use. For many projects it is better to let funannotate train gene predictors using the BUSCO protein alignments (or better yet the RNAseq transcripts) however you can set a pre-trained model for the predict step in augustus. It will still need to develop a training set for SNAP predictions if you want to include that in the composite gene model prediction. The observations about whether additional re-training and optimization has a major impact - the --optimize-augustus option can be invoked to add additional levels of optimization but its relative impact varies a lot and is not necessarily worth the extra time.
  3. --species this defines what is the species name - this will be what is included in the GenBank record, and determine how the output file names. -you should do --species "Botrytis cinerea" when you run it - it will automatically add underscores when naming the output files that go in predict_results - the filenames will be SPECIES_STRAIN where any spaces are converted to underscores.