nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

the value of 'augustus' is set as 'pasa' in the log of predict #922

Open sqwwww opened 1 year ago

sqwwww commented 1 year ago

hi, I'm confused about the condition here. in the log of funannotate predict, I saw that the the value of 'augustus' is set as 'pasa', while it was set as 'busco' in many other people's logs. I checked many tutorial and didn't find the explanation, could you please help me to explain it?

you can see there is a line in my log:

[05/31/23 11:32:53]: {'augustus': 'pasa', 'genemark': 'selftraining', 'snap': 'pasa', 'glimmerhmm': 'pasa'}

where many other people(other issues in the web) showed that:

{'augustus': 'busco', 'snap': 'busco', 'glimmerhmm': 'busco'}

the whole log I got is :

[05/31/23 11:32:42]:funannotate predict -i MyAssembly.fa -o fun --species Dun dac --cpus 16 --max_intronlen 500000 --busco_db actinopterygii --organism other --protein_alignments aln_Danio_rerio.gff --repeats2evm --force

[05/31/23 11:32:42]: OS: CentOS Linux 7, 88 cores, ~ 6241 GB RAM. Python: 3.8.15 [05/31/23 11:32:42]: Running funannotate v1.8.15 [05/31/23 11:32:42]: GeneMark path: /home/tools2/gmes_linux_64_4 [05/31/23 11:32:47]: Full path to gmes_petap.pl: /home/tools2/gmes_linux_64_4/gmes_petap.pl [05/31/23 11:32:47]: GeneMark appears to be functional? True 05/31/23 11:32:51: exonerate version=exonerate 2.4.0 path=/home/miniconda3/envs/mamba/envs/fun/bin/exonerate 05/31/23 11:32:51: diamond version=2.1.6 path=/home/miniconda3/envs/mamba/envs/fun/bin/diamond 05/31/23 11:32:51: tbl2asn version=25.8 path=/home/miniconda3/envs/mamba/envs/fun/bin/tbl2asn 05/31/23 11:32:51: bedtools version=bedtools v2.31.0 path=/home/miniconda3/envs/mamba/envs/fun/bin/bedtools 05/31/23 11:32:51: augustus version=3.5.0 path=/home/miniconda3/envs/mamba/envs/fun/bin/augustus 05/31/23 11:32:51: etraining version=NA path=/home/miniconda3/envs/mamba/envs/fun/bin/etraining 05/31/23 11:32:51: tRNAscan-SE version=2.0.11 (Oct 2022) path=/home/miniconda3/envs/mamba/envs/fun/bin/tRNAscan-SE 05/31/23 11:32:51: bam2hints version=NA path=/home/miniconda3/envs/mamba/envs/fun/bin/bam2hints 05/31/23 11:32:51: minimap2 version=2.26-r1175 path=/home/miniconda3/envs/mamba/envs/fun/bin/minimap2

05/31/23 11:32:51: Found training files, will re-use these files: --rna_bam fun/training/funannotate_train.coordSorted.bam --pasa_gff fun/training/funannotate_train.pasa.gff3 --stringtie fun/training/funannotate_train.stringtie.gtf --transcript_alignments fun/training/funannotate_train.transcripts.gff3 [05/31/23 11:32:53]: {'augustus': 1, 'hiq': 2, 'genemark': 1, 'pasa': 6, 'codingquarry': 0, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1} [05/31/23 11:32:53]: Skipping CodingQuarry as --organism=other. Pass a weight larger than 0 to run CQ, ie --weights codingquarry:1 [05/31/23 11:32:53]: {'augustus': 'pasa', 'genemark': 'selftraining', 'snap': 'pasa', 'glimmerhmm': 'pasa'} [05/31/23 11:32:53]: Parsed training data, run ab-initio gene predictors as follows: [05/31/23 11:32:53]: augustus --species=anidulans --proteinprofile=/home/miniconda3/envs/mamba/envs/fun/lib/python3.8/site-packages/funannotate/config/EOG092C0B3U.prfl /home/miniconda3/envs/mamba/envs/fun/lib/python3.8/site-packages/funannotate/config/busco_test.fa [05/31/23 11:32:56]: perl /home/miniconda3/envs/mamba/envs/fun/opt/evidencemodeler-1.1.1/EvmUtils/gff3_gene_prediction_file_validator.pl /home/06.genomeAnno/02.fun/fun/predict_misc/pasa_predictions.gff3 [05/31/23 11:33:04]: {'augustus': 1, 'hiq': 2, 'genemark': 1, 'pasa': 6, 'codingquarry': 0, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1} [05/31/23 11:35:12]: Loading genome assembly and parsing soft-masked repetitive sequences [05/31/23 11:35:27]: Genome loaded: 161 scaffolds; 670,790,457 bp; 42.37% repeats masked [05/31/23 11:35:34]: Parsed 360,939 transcript alignments from: fun/training/funannotate_train.transcripts.gff3 [05/31/23 11:35:34]: Creating transcript EVM alignments and Augustus transcripts hintsfile [05/31/23 11:35:44]: Existing RNA-seq BAM hints found: fun/predict_misc/hints.BAM.gff [05/31/23 11:35:44]: Loading protein alignments /home/06.genomeAnno/05.miniprot/aln_Danio_rerio.gff

[05/31/23 11:36:39]: Running GeneMark-ES on assembly [05/31/23 11:36:39]: /home/tools2/gmes_linux_64_4/gmes_petap.pl --ES --max_intron 500000 --soft_mask 2000 --cores 16 --sequence /home/06.genomeAnno/02.fun/fun/predict_misc/genome.softmasked.fa

hyphaltip commented 1 year ago

this indicates where the training set for training the gene predictor(s) comes from.

This would be set to pasa because you had done a train step before running predict: based on this information in your log:

[05/31/23 11:32:51]: Found training files, will re-use these files:
--rna_bam fun/training/funannotate_train.coordSorted.bam
--pasa_gff fun/training/funannotate_train.pasa.gff3
--stringtie fun/training/funannotate_train.stringtie.gtf
sqwwww commented 1 year ago

hi sir, thanks for your quick answer, it helps me a lot! I have another problem which is similar to the issue #920 , probably I have more details here. you can see that there is a line in my funannotate-predict.log, it set the --species=anidulans for augustus, which was not the species I set.

augustus --species=anidulans --proteinprofile=/home/miniconda3/envs/mamba/envs/fun/lib/python3.8/site-packages/funannotate/config/EOG092C0B3U.prfl /home/miniconda3/envs/mamba/envs/fun/lib/python3.8/site-packages/funannotate/config/busco_test.fa

then I checked the ./fun/logfiles/augustus_training.log, I found it have trained my interested species in augustus. and I can found my species under ./fun/predict_misc/ab_initio_parameters/augustus/species/. Unfortunately, it failed because of a small error. I fixed the environment by conda install -c anaconda gsl. the augustus_training.log:

Will create parameters for a EUKARYOTIC species! creating directory /home/06.genomeAnno/02.fun/fun/predict_misc/ab_initio_parameters/augustus/species/dun_dac/ ... creating /home/06.genomeAnno/02.fun/fun/predict_misc/ab_initio_parameters/augustus/species/dun_dac/dun_dac_parameters.cfg ... creating /home/06.genomeAnno/02.fun/fun/predict_misc/ab_initio_parameters/augustus/species/dun_dac/dun_dac_weightmatrix.txt ... creating /home/06.genomeAnno/02.fun/fun/predict_misc/ab_initio_parameters/augustus/species/dun_dac/dun_dac_metapars.cfg ... The necessary files for training dun_dac have been created. Now, either run etraining or optimize_parameters.pl with --species=dun_dac. etraining quickly estimates the parameters from a file with training genes. optimize_augustus.pl alternates running etraining and augustus to find optimal metaparameters.

etraining: error while loading shared libraries: libgsl.so.25: cannot open shared object file: No such file or directory

the contents under fun/predict_misc/ab_initio_parameters/augustus/species/

ls ./fun/predict_misc/ab_initio_parameters/augustus/species/
anidulans  dun_dac  generic

then I rerun the funannotate in another directory (fun2, I soft linked the training directory under the fun2, as I didn't want to stop the gmes running under the original fun directory) , and this time augustus only trained anidulans, and no training for my interested species, and there was no ./fun2/logfiles/augustus_training.log. the contents under ./fun2/predict_misc/ab_initio_parameters/augustus/species:

ls ./fun2/predict_misc/ab_initio_parameters/augustus/species/
anidulans  generic

my command :

funannotate predict -i MyAssembly.fa -o fun2 \
    --species "Dun dac" \
    --cpus 8 \
    --max_intronlen 500000 --busco_db actinopterygii --organism other \
    --protein_alignments /home/06.genomeAnno/05.miniprot/aln_Danio_rerio.gff \
    --repeats2evm --force

I think I can try to run augustus outside of funannotate, and manually set the species as my target. could you please give me some suggestion about the situation?

sqwwww commented 1 year ago

this is an update, It seems that funannotate solves the problem itself. It continue to run augustus from where the program has give an error after finishing the gmes. Now, every thing goes smoothly.

It is worth mentioning that the gmes was very slow, spent 7 days for a ~600M genome.

sqwwww commented 1 year ago

hi sir, I met another error in EVM running, which said "IndexError: list index out of range". However I can run the funannotate test -t predict successfully. Here is my error log:


[May 31 11:32 AM]: OS: CentOS Linux 7, 88 cores, ~ 6241 GB RAM. Python: 3.8.15 [May 31 11:32 AM]: Running funannotate v1.8.15 [May 31 11:32 AM]: Found training files, will re-use these files: --rna_bam fun/training/funannotate_train.coordSorted.bam --pasa_gff fun/training/funannotate_train.pasa.gff3 --stringtie fun/training/funannotate_train.stringtie.gtf --transcript_alignments fun/training/funannotate_train.transcripts.gff3 [May 31 11:32 AM]: Skipping CodingQuarry as --organism=other. Pass a weight larger than 0 to run CQ, ie --weights codingquarry:1 [May 31 11:32 AM]: Parsed training data, run ab-initio gene predictors as follows: Program Training-Method augustus pasa
genemark selftraining
glimmerhmm pasa
snap pasa
[May 31 11:35 AM]: Loading genome assembly and parsing soft-masked repetitive sequences [May 31 11:35 AM]: Genome loaded: 161 scaffolds; 670,790,457 bp; 42.37% repeats masked [May 31 11:35 AM]: Parsed 360,939 transcript alignments from: fun/training/funannotate_train.transcripts.gff3 [May 31 11:35 AM]: Creating transcript EVM alignments and Augustus transcripts hintsfile [May 31 11:35 AM]: Existing RNA-seq BAM hints found: fun/predict_misc/hints.BAM.gff [May 31 11:35 AM]: Loading protein alignments /home/06.genomeAnno/05.miniprot/aln_Danio_rerio.gff [May 31 11:36 AM]: Running GeneMark-ES on assembly [Jun 07 01:12 PM]: 57,909 predictions from GeneMark [Jun 07 01:12 PM]: Filtering PASA data for suitable training set [Jun 07 01:14 PM]: 5,779 of 44,806 models pass training parameters [Jun 07 01:14 PM]: Training Augustus using PASA gene models [Jun 07 01:16 PM]: Augustus initial training results: Feature Specificity Sensitivity nucleotides 91.6% 90.3%
exons 77.0% 76.2%
genes 15.9% 15.2%
[Jun 07 01:16 PM]: Accuracy seems low, you can try to improve by passing the --optimize_augustus option. [Jun 07 01:16 PM]: Running Augustus gene prediction using Dun_dac parameters [Jun 07 02:28 PM]: 36,128 predictions from Augustus [Jun 07 02:28 PM]: Pulling out high quality Augustus predictions [Jun 07 02:28 PM]: Found 10,415 high quality predictions from Augustus (>90% exon evidence) [Jun 07 02:28 PM]: Running SNAP gene prediction, using training data: fun/predict_misc/final_training_models.gff3 [Jun 07 03:04 PM]: 101,855 predictions from SNAP [Jun 07 03:04 PM]: Running GlimmerHMM gene prediction, using training data: fun/predict_misc/final_training_models.gff3 [Jun 07 04:25 PM]: 111,590 predictions from GlimmerHMM [Jun 07 04:26 PM]: Summary of gene models passed to EVM (weights): [Jun 07 04:26 PM]: EVM: partitioning input to ~ 35 genes per partition using min 1500 bp interval Traceback (most recent call last): File "/home/miniconda3/envs/mamba/envs/fun/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-runEVM.py", line 480, in cmdinfo = create_partitions(args.fasta, args.genes, partitions, File "/home/miniconda3/envs/mamba/envs/fun/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-runEVM.py", line 138, in create_partitions interProteins = exonerate_blocks_to_interlap(proteins) File "/home/miniconda3/envs/mamba/envs/fun/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-runEVM.py", line 46, in exonerate_blocks_to_interlap coords.append(int(cols[3])) IndexError: list index out of range Source Weight Count  Augustus 1 25713 Augustus HiQ 2 10415 GeneMark 1 57909 GlimmerHMM 1 111590 pasa 6 44806 snap 1 101855 Total - 352288 [Jun 07 04:27 PM]: Evidence modeler has failed, exiting Traceback (most recent call last): File "/home/miniconda3/envs/mamba/envs/fun/bin/funannotate", line 10, in sys.exit(main()) File "/home/miniconda3/envs/mamba/envs/fun/lib/python3.8/site-packages/funannotate/funannotate.py", line 716, in main mod.main(arguments) File "/home/miniconda3/envs/mamba/envs/fun/lib/python3.8/site-packages/funannotate/predict.py", line 2624, in main os.remove(EVM_out) FileNotFoundError: [Errno 2] No such file or directory: '/home/06.genomeAnno/02.fun/fun/predict_misc/evm.round1.gff3' /var/spool/slurm/d/job1350207/slurm_script: line 46: 112365 Bus error (core dumped) funannotate predict -i MyAssembly.fa -o fun --species "Dun dac" --cpus 16 --max_intronlen 500000 --busco_db actinopterygii --organism other --protein_alignments /home/06.genomeAnno/05.miniprot/aln_Danio_rerio.gff --repeats2evm --force

here is my test log :

[Jun 06 10:14 PM]: OS: CentOS Linux 7, 128 cores, ~ 528 GB RAM. Python: 3.8.15 [Jun 06 10:14 PM]: Running funannotate v1.8.15 [Jun 06 10:14 PM]: Skipping CodingQuarry as no --rna_bam passed [Jun 06 10:14 PM]: Parsed training data, run ab-initio gene predictors as follows: Program Training-Method augustus pretrained
genemark selftraining
glimmerhmm busco
snap busco
[Jun 06 10:14 PM]: Loading genome assembly and parsing soft-masked repetitive sequences [Jun 06 10:14 PM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked [Jun 06 10:15 PM]: Mapping 1,065 proteins to genome using diamond and exonerate [Jun 06 10:15 PM]: Found 1,505 preliminary alignments with diamond in 0:00:01 --> generated FASTA files for exonerate in 0:00:00 [Jun 06 10:15 PM]: Exonerate finished in 0:00:12: found 1,270 alignments [Jun 06 10:15 PM]: Running GeneMark-ES on assembly [Jun 06 10:16 PM]: 1,564 predictions from GeneMark [Jun 06 10:16 PM]: Running BUSCO to find conserved gene models for training ab-initio predictors [Jun 06 10:20 PM]: 370 valid BUSCO predictions found, validating protein sequences [Jun 06 10:20 PM]: 367 BUSCO predictions validated [Jun 06 10:20 PM]: Running Augustus gene prediction using saccharomyces parameters [Jun 06 10:21 PM]: 1,485 predictions from Augustus [Jun 06 10:22 PM]: Pulling out high quality Augustus predictions [Jun 06 10:22 PM]: Found 371 high quality predictions from Augustus (>90% exon evidence) [Jun 06 10:22 PM]: Running SNAP gene prediction, using training data: annotate/predict_misc/busco.final.gff3 [Jun 06 10:22 PM]: 1,511 predictions from SNAP [Jun 06 10:22 PM]: Running GlimmerHMM gene prediction, using training data: annotate/predict_misc/busco.final.gff3 [Jun 06 10:25 PM]: 1,779 predictions from GlimmerHMM [Jun 06 10:25 PM]: Summary of gene models passed to EVM (weights): [Jun 06 10:25 PM]: EVM: partitioning input to ~ 35 genes per partition using min 1500 bp interval [Jun 06 10:26 PM]: Converting to GFF3 and collecting all EVM results Source Weight Count Augustus 1 1325 Augustus HiQ 2 372
GeneMark 1 1564 GlimmerHMM 1 1779 snap 1 1511 Total - 6551 [Jun 06 10:26 PM]: 1,718 total gene models from EVM [Jun 06 10:26 PM]: Generating protein fasta files from 1,718 EVM models [Jun 06 10:26 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [Jun 06 10:26 PM]: Found 112 gene models to remove: 0 too short; 0 span gaps; 112 transposable elements [Jun 06 10:26 PM]: 1,606 gene models remaining [Jun 06 10:26 PM]: Predicting tRNAs [Jun 06 10:26 PM]: 112 tRNAscan models are valid (non-overlapping) [Jun 06 10:26 PM]: Generating GenBank tbl annotation file [Jun 06 10:26 PM]: Collecting final annotation files for 1,718 total gene models [Jun 06 10:26 PM]: Converting to final Genbank format [Jun 06 10:27 PM]: Funannotate predict is finished, output files are in the annotate/predict_results folder [Jun 06 10:27 PM]: Your next step might be functional annotation, suggested commands: -------------------------------------------------------1 Run InterProScan (manual install): funannotate iprscan -i annotate -c 16

Run antiSMASH (optional): funannotate remote -i annotate -m antismash -e youremail@server.edu

Annotate Genome: funannotate annotate -i annotate --cpus 16 --sbt yourSBTfile.txt -------------------------------------------------------1

[Jun 06 10:27 PM]: Training parameters file saved: annotate/predict_results/saccharomyces.parameters.json [Jun 06 10:27 PM]: Add species parameters to database:

funannotate species -s saccharomyces -a annotate/predict_results/saccharomyces.parameters.json

I have no idea why the error happened here, could you please give me some suggestion?

hyphaltip commented 1 year ago

potentially out of memory error - does thefunannotate test succeed for the predict step on its own?

/var/spool/slurm/d/job1350207/slurm_script: line 46: 112365 Bus error (core dumped
sqwwww commented 1 year ago

hi hyphaltip,

yes, funannotate test succeed for the predict step on its own.

I have figured it out. It's a problem related with the format of protein alignment file. I used the gff generated by miniprot, and fed it to the --protein_alignments parameter, and it failed in the file reading. I tried to directly feed the protein fasta to the --protein_evidence, and it succeed.

here I have a more question when running the funannotate update. I used mysql database in the pasa training, however in the log of funannotate update, it said "PASA database is SQLite", which confused me a lot. it's important because mutithreading is related with the database type(mysql database can be mutithreading).

I can find a file called fun/training/pasa/__pasa_Dun_dac_pasa_mysql_chkpts and also a file called fun/update_misc/pasa/__pasa_Dun_dac_pasa_mysql_chkpts, which convinces me that I have trained the data by mysql database.

but I find a line in funannotate-update.log said that PASA database is SQLite. Maybe it's an misleading output.

[06/09/23 01:34:43]: PASA database is SQLite: Dun_dac_pasa

hyphaltip commented 1 year ago

I think you need to pass this option to force MySQL/MariaDB

--pasa_db mysql