Closed estolle closed 5 years ago
Hi Eckart. It seems like its getting hung up when parsing your Augustus GFF3 file. Funannotate runs Augustus is paraellel and then merges the output, so I don't think letting it re-run Augustus will take that much time. To make sure your training parameters are in a format that funanntoate can use, try funannotate species
and verify that Euglossa_dilemma
is in that list. Then go into the funannotate output directory and delete the augustus related files in funannotate_out/predict_misc
-- these should all start with augustus...
. By default funannotate re-uses any valid files that are found in this folder, so when changing settings you need to manually clear a few files. Then you can try to re-run your command like this (just leave out the Augustus results) which will re-use existing data and then run Augustus:
funannotate predict \
-i $FOLDER/Edil.repeats.softmasked.pb.fa \
-o funannotate.test2 \
-s "Euglossa dilemma" \
--genemark_gtf $FOLDER/genemark.ES.gtf \
--augustus_species Euglossa_dilemma \
--stringtie /scratch/ek/euglossa/euglossa_TX_assembly4_stringtie_genomeguided/eug33Tx_vs_Edil_v1_Tx.only_ref_overlapping.annotated.gtf \
--rna_bam /scratch/ek/euglossa/2018_03_01_hisat2_mapping/merged_bams/euglossa.all.hisat2.bam \
--other_gff $FOLDER/snap.predicted.gff \
--busco_db hymenoptera \
--cpus 100 \
--SeqCenter XXXX \
--repeat_filter overlap blast \
--organism other \
--ploidy 1 \
--protein_evidence \
$FUNANNOTATE_DB/uniprot_sprot.fasta \
/scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_protein.faa \
/scratch/genomes/genomes_original/Mqua/GCA_001276565.1_ASM127656v1_protein.faa \
/scratch/genomes/genomes_original/Bter/GCF_000214255.1_Bter_1.0_protein.faa \
/scratch/genomes/genomes_original/Amel/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_protein.faa \
/scratch/genomes/genomes_original/Bimp/GCF_000188095.2_BIMP_2.1_protein.faa \
--transcript_evidence \
$FOLDER/dil_vir_merged_transcriptome.fa \
/scratch/ek/euglossa/euglossa_TX_assembly2.4samples/BinPacker.fa \
/scratch/genomes/genomes_original/Edil/Edil_v1.0_transcripts.fa \
/scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_rna.fna \
/scratch/genomes/genomes_original/Mqua/GCA_001276565.1_ASM127656v1_rna_from_genomic.fna \
/scratch/genomes/genomes_original/Bter/GCF_000214255.1_Bter_1.0_rna.fna \
/scratch/genomes/genomes_original/Amel/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_rna.fna \
/scratch/genomes/genomes_original/Bimp/GCF_000188095.2_BIMP_2.1_rna.fna
Let me know if you are getting the same error then if funannotate runs Augustus.
Thanks for the quick answer.
i tried around in the meantime to also find out if the input data could cause trouble (some of the protein/transcript evidence comes from related species).
With reduced input it ran through! =) Augustus ran very fast indeed. the Euglossa_dilemma species is the one I added myself after training based on the ~4000 BUSCOs (it is in the list and works)
There were 2 issues I could fix (perhaps worth noting in the INSTALL prerequesites I had to specify QUARRY_PATH=/opt/CodingQuarry_v2.0/QuarryFiles so that funannotate could proceed with running CodingQuarry. Also, Genemark failed to run, but I could run it manually --> but only after I reduced the number of cores specified with --cores to 60 (from --cores=100). Looking into the genemark script (/opt/gm_et_linux_64/gmes_petap/gmes_petap.pl) I found an option where it checks whether cores > 64 and then throws and error that --cores is out of range. I edited the file and can confirm it works manually (thus should work within funannotate as well).
my reduced command:( funannotate predict \ -i $FOLDER/Edil.repeats.softmasked.pb.fa \ -o funannotate.test3 \ -s "Euglossa dilemma" \ --augustus_species Euglossa_dilemma \ --rna_bam /scratch/ek/euglossa/2018_03_01_hisat2_mapping/merged_bams/euglossa.all.hisat2.bam \ --busco_db hymenoptera \ --cpus 100 \ --SeqCenter XXXX \ --repeat_filter overlap blast \ --organism other \ --ploidy 1 \ --protein_evidence \ $FUNANNOTATE_DB/uniprot_sprot.fasta \ /scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_protein.faa \ /scratch/genomes/genomes_original/Mqua/GCA_001276565.1_ASM127656v1_protein.faa \ /scratch/genomes/genomes_original/Bter/GCF_000214255.1_Bter_1.0_protein.faa \ /scratch/genomes/genomes_original/Amel/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_protein.faa \ /scratch/genomes/genomes_original/Bimp/GCF_000188095.2_BIMP_2.1_protein.faa \ --transcript_evidence \ $FOLDER/dil_vir_merged_transcriptome.fa \ /scratch/ek/euglossa/euglossa_TX_assembly2.4samples/BinPacker.fa \ /scratch/genomes/genomes_original/Edil/Edil_v1.0_transcripts.fa \ )
CodeQuarry models (10): 27,282 Augustus models (1): 2,437 Genemark models (1): 0 HiQ models (5): 19 Pasa models (1): 0 Total models: 29,738
[01:46 AM]: Generating protein fasta files from 17,905 EVM models [01:46 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [01:46 AM]: Found 3,043 gene models to remove: 72 too short; 19 span gaps; 3,899 transposable elements [01:46 AM]: 14,862 gene models remaining
Do these numbers sound OK to you? I am a little puzzled that there are only 19 HQ Augustus predictions (in the augustus GFF I can see that most predictions are supported by maybe around 50% evidence (length covered)). I expected more, particularly since BUSCO found >3500 complete single copy orthologs. Also the total Augustus predictions seem too low given the previous BUSCO numbers. Do you know a way to improve this?
Thanks for the feedback. Just pushed https://github.com/nextgenusfs/funannotate/commit/f3e9f3b75fb4bb8450f24c7d14d2c4d985055d95 which will limit gene mark to 64 cpus -- could you retry and see if it runs now?
CodingQuarry may not be helpful for you -- I know that it over predicts in fungi (also tends to produce short gene models due to the string tie training set), but with the smaller genomes the tradeoff of finding a few novel genes might be worth it, but with higher eukaryotes and larger genomes, I'm not sure how helpful it is. Thus I haven't made it a core part of funannotate. If you re-run and let it run GeneMark, then looking at the number of models predicted by GeneMark should be informative to how much CodingQuarry may be over prediction.
I would also like to see what happens if you let funannotate train Augustus -- there are a lot of checks/double-checks that funannotate runs to filter the training sets, this might give you a better result.
You may also think about using the funannotate train
module to run PASA. I find that the PASA data are typically better at training Augustus than the BUSCO results.
Thanks alot for fixing the genemark cpu option. It works now.
As you suggested I ran funannotate again and let it train augustus. <( funannotate predict \ -i $FOLDER/Edil.repeats.softmasked.pb.fa \ -o funannotate.test4 \ -s "Euglossa viridissima" \ --busco_seed_species Euglossa_dilemma \ --rna_bam /scratch/ek/euglossa/2018_03_01_hisat2_mapping/merged_bams/euglossa.all.hisat2.bam \ --busco_db hymenoptera \ --cpus 100 \ --genemark_mode ES \ --SeqCenter XXXX \ --repeat_filter overlap blast \ --organism other \ --ploidy 1 \ --stringtie /scratch/ek/euglossa/euglossa_TX_assembly4_stringtie_genomeguided/eug33Tx_vs_Edil_v1_Tx.only_ref_overlapping.annotated.gtf \ --other_gff $FOLDER/snap.predicted.gff \ --protein_evidence \ $FUNANNOTATE_DB/uniprot_sprot.fasta \ /scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_protein.faa \ /scratch/genomes/genomes_original/Mqua/GCA_001276565.1_ASM127656v1_protein.faa \ /scratch/genomes/genomes_original/Bter/GCF_000214255.1_Bter_1.0_protein.faa \ /scratch/genomes/genomes_original/Amel/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_protein.faa \ /scratch/genomes/genomes_original/Bimp/GCF_000188095.2_BIMP_2.1_protein.faa \ --transcript_evidence \ $FOLDER/dil_vir_merged_transcriptome.fa \ /scratch/ek/euglossa/euglossa_TX_assembly2.4samples/BinPacker.fa \ /scratch/genomes/genomes_original/Edil/Edil_v1.0_transcripts.fa \ /scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_rna.fna )>
[04:48 AM]: Setting up EVM partitions [04:57 AM]: Generating EVM command list [04:57 AM]: Running EVM commands with 99 CPUs [05:09 AM]: Combining partitioned EVM outputs [05:09 AM]: Converting EVM output to GFF3 [05:42 AM]: Collecting all EVM results [05:42 AM]: 42,756 total gene models from EVM [05:42 AM]: Generating protein fasta files from 42,756 EVM models [05:42 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [05:43 AM]: Found 9,311 gene models to remove: 41 too short; 34 span gaps; 10,725 transposable elements [05:43 AM]: 33,445 gene models remaining [05:43 AM]: Predicting tRNAs ^[[B[11:52 AM]: Found 185 tRNA gene models [11:52 AM]: 123 tRNAscan models are valid (non-overlapping) [11:52 AM]: Generating GenBank tbl annotation file [11:53 AM]: Converting to final Genbank format [12:06 PM]: Collecting final annotation files for 33,568 total gene models [12:06 PM]: Funannotate predict is finished, output files are in the funannotate.test4/predict_results folder [12:06 PM]: Your next step to capture UTRs and update annotation using PASA:
( funannotate update -i funannotate.test4 --cpus 100 \ --left /scratch/ek/reads/euglossa/euglossa.R1.fastq.gz \ --right /scratch/ek/reads/euglossa/euglossa.R2.fastq.gz \ --jaccard_clip --memory 200G --stranded no --pasa_db mysql \ --species "Euglossa viridissima2" --max_intronlen 5000 --pasa_alignment_overlap 50.0 --coverage 50 \ --trinity /scratch/ek/euglossa/euglossa.annotation.phil.brand/dil_vir_merged_transcriptome.fa )
[05:47 AM]: Manually edit the tbl file funannotate.test4/update_results/Euglossa_viridissima2.tbl, then run:
funannotate fix -i funannotate.test4/update_results/Euglossa_viridissima2.gbk -t funannotate.test4/update_results/Euglossa_viridissima2.tbl
[05:47 AM]: After the problematic gene models are fixed, you can proceed with functional annotation. [05:47 AM]: Your next step might be functional annotation, suggested commands:
I also tried running it with trainining (PASA) first: <( funannotate train --cpus 100 --species "Euglossa viridissima2" --max_intronlen 5000 --pasa_alignment_overlap 50.0 --pasa_db mysql --coverage 50 --memory 200G --stranded no \ --trinity /scratch/ek/euglossa/euglossa.annotation.phil.brand/dil_vir_merged_transcriptome.fa \ --input /scratch/ek/euglossa/euglossa.annotation.phil.brand/funannotate.test1/predict_misc/genome.softmasked.fa \ --out funannotate.train1 \ --left /scratch/ek/reads/euglossa/euglossa.R1.fastq.gz \ --right /scratch/ek/reads/euglossa/euglossa.R2.fastq.gz )>
[11:29 PM]: OS: linux2, 112 cores, ~ 528 GB RAM. Python: 2.7.12 [11:29 PM]: Running funannotate v1.5.0 [11:29 PM]: Trimmomatic will be skipped [11:29 PM]: Read normalization will be skipped [11:29 PM]: Parsing assembled trinity data : /scratch/ek/euglossa/euglossa.annotation.phil.brand/dil_vir_merged_transcriptome.fa [11:29 PM]: Converting transcript alignments to GFF3 format [11:30 PM]: Converting Trinity transcript alignments to GFF3 format [11:30 PM]: Running PASA alignment step using 118,041 transcripts [04:30 AM]: PASA assigned 85,152 transcipts to 84,408 loci (genes) [04:30 AM]: Getting PASA models for training with TransDecoder [04:36 AM]: PASA finished. PASAweb accessible via: localhost:port/cgi-bin/index.cgi?db=Euglossa_viridissima2 [04:36 AM]: Using Kallisto TPM data to determine which PASA gene models to select at each locus [04:36 AM]: Building Kallisto index [04:38 AM]: Mapping reads using pseudoalignment in Kallisto [05:54 AM]: Parsing expression value results. Keeping best transcript at each locus. [05:54 AM]: Wrote 9,911 PASA gene models [05:54 AM]: PASA database name: Euglossa_viridissima2 [05:54 AM]: Trinity/PASA has completed, you are now ready to run funanotate predict, for example:
( funannotate predict \ -i /scratch/ek/euglossa/euglossa.annotation.phil.brand/funannotate.test1/predict_misc/genome.softmasked.fa \ -o funannotate.train1 \ -s "Euglossa viridissima2" \ --busco_seed_species Euglossa_dilemma \ --rna_bam /scratch/ek/euglossa/2018_03_01_hisat2_mapping/merged_bams/euglossa.all.hisat2.bam \ --busco_db hymenoptera \ --cpus 100 \ --SeqCenter XXXX \ --repeat_filter overlap blast \ --organism other \ --ploidy 1 \ --stringtie /scratch/ek/euglossa/euglossa_TX_assembly4_stringtie_genomeguided/eug33Tx_vs_Edil_v1_Tx.only_ref_overlapping.annotated.gtf \ --other_gff $FOLDER/snap.predicted.gff \ --protein_evidence \ $FUNANNOTATE_DB/uniprot_sprot.fasta \ /scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_protein.faa \ /scratch/genomes/genomes_original/Mqua/GCA_001276565.1_ASM127656v1_protein.faa \ /scratch/genomes/genomes_original/Bter/GCF_000214255.1_Bter_1.0_protein.faa \ /scratch/genomes/genomes_original/Amel/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_protein.faa \ /scratch/genomes/genomes_original/Bimp/GCF_000188095.2_BIMP_2.1_protein.faa \ --transcript_evidence \ $FOLDER/dil_vir_merged_transcriptome.fa \ /scratch/ek/euglossa/euglossa_TX_assembly2.4samples/BinPacker.fa \ /scratch/genomes/genomes_original/Edil/Edil_v1.0_transcripts.fa \ /scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_rna.fna )
[10:13 PM]: Setting up EVM partitions [10:22 PM]: Generating EVM command list [10:22 PM]: Running EVM commands with 99 CPUs [10:35 PM]: Combining partitioned EVM outputs [10:35 PM]: Converting EVM output to GFF3 [11:07 PM]: Collecting all EVM results [11:07 PM]: 44,935 total gene models from EVM [11:07 PM]: Generating protein fasta files from 44,935 EVM models [11:07 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [11:08 PM]: Found 9,328 gene models to remove: 40 too short; 37 span gaps; 10,748 transposable elements [11:08 PM]: 35,607 gene models remaining [11:08 PM]: Predicting tRNAs [04:57 AM]: Found 185 tRNA gene models [04:57 AM]: 128 tRNAscan models are valid (non-overlapping) [04:57 AM]: Generating GenBank tbl annotation file [04:57 AM]: Converting to final Genbank format [05:11 AM]: Collecting final annotation files for 35,735 total gene models [05:11 AM]: Funannotate predict is finished, output files are in the funannotate.train1/predict_results folder [05:11 AM]: Your next step to capture UTRs and update annotation using PASA:
( funannotate update -i funannotate.train1 --cpus 100 \ --left /scratch/ek/reads/euglossa/euglossa.R1.fastq.gz \ --right /scratch/ek/reads/euglossa/euglossa.R2.fastq.gz \ --jaccard_clip --memory 200G --stranded no --pasa_db mysql \ --species "Euglossa viridissima2" --max_intronlen 5000 --pasa_alignment_overlap 50.0 --coverage 50 \ --trinity /scratch/ek/euglossa/euglossa.annotation.phil.brand/dil_vir_merged_transcriptome.fa )
after running this last step (funannotate update) i got an error (funannotate update with basically the same commands/inputs worked befoe in the first example I gave above):
[05:27 PM]: OS: linux2, 112 cores, ~ 528 GB RAM. Python: 2.7.12
[05:27 PM]: Running funannotate v1.5.0
[05:27 PM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt'
[05:27 PM]: Found relevent files in funannotate.train1/training, will re-use them:
Forward reads: funannotate.train1/training/left.fq.gz
Reverse reads: funannotate.train1/training/right.fq.gz
Trinity results: funannotate.train1/training/funannotate_train.trinity-GG.fasta
PASA config file: funannotate.train1/training/pasa/alignAssembly.txt
[05:28 PM]: Reannotating Euglossa viridissima2, NCBI accession: None
[05:28 PM]: Previous annotation consists of: 35,607 protein coding gene models and 128 non-coding gene models
[05:28 PM]: Trimmomatic will be skipped
[05:28 PM]: Read normalization will be skipped
Traceback (most recent call last):
File "/opt/funnotate/funannotate-1.5.1/bin/funannotate-update.py", line 1866, in
Do you know what this means? I am a bit puzzled about the error since it worked before with this command/input, but on the other run where I just did training with augustus (no PASA involved).
Another question for me was whether I should run genemark ES or ET - I ran both and they find 40k vs 28 or 29 k models, so the difference is not big. Do you know if the genemark predictions are alot better when they come from ET (with transcript evidence)? It also seems that genemark find much more gene models than codingquarry, almost twice as much. In the end around 35k remain. I have to look up if this a reasonable number.
sorry the formatting is off, the options here in the webbrowser form are not well visible/clear
and github doesn't let me edit my posts
Hi there
Thanks for developing this great tool for annotation!
I am trying since a bit to annotate a non-model organism genome (orchid bee, >500Mb assembly, rather fragmented, i.e. lots of small'ish contigs) and we have RNAseq reads of a few dozen individuals, several transcriptome assemblies based on those RNAseq reads (trinity, binpacker, stringtie), I trained Augustus for the species based on ca 4000 Hymenoptera BUSCOs and I am supplying additional evidence (transcripts, proteins) from more or less closely related species. I got genemark-ES predictions as well as Augustus predictions separately.
Repeatmasking runs fine but at the predict step I run into problems with genemark and/or Augustus (as far as I can see the relevant folders/executables/scripts are in the PATH and the funannotate environmental variables are set properly). If I do not supply the genemark gtf from a separate run, it fails to run genemark. THe same for AUgustus. Now supplying the gtf/gff from separate runs of both programs still gives an error (index out of range) and that no high quality Augustus predictions were found, perhaps because the hints/evidence is mostly not covering the predicted transcripts to 90% or greater, see below). Do you have any idea what the "index out of range" means and how I can fix it or circumvent it? Also, how can I tell funannotate to be less strict about the HQ predictions?
Thanks in advance Eckart
System: Ubuntu 16.04 all dependencies/env vars installed (only ete3 has some python related warnings).
Commandline output while running funannotate
The predict.log says this:
My command was:
When I run Augustus on its own, it seems to work fine (I use this Augustus instllation for BUSCO and trained it already for my species. Following your suggestion in a previous Github issue in here i am running Augustus right now like this (runs fine thus far, but hints/evidences are not covering the transcripts very well it seems):
Evidence for and against this transcript: