estolle commented 5 years ago

Hi there

Thanks for developing this great tool for annotation!

I am trying since a bit to annotate a non-model organism genome (orchid bee, >500Mb assembly, rather fragmented, i.e. lots of small'ish contigs) and we have RNAseq reads of a few dozen individuals, several transcriptome assemblies based on those RNAseq reads (trinity, binpacker, stringtie), I trained Augustus for the species based on ca 4000 Hymenoptera BUSCOs and I am supplying additional evidence (transcripts, proteins) from more or less closely related species. I got genemark-ES predictions as well as Augustus predictions separately.

Repeatmasking runs fine but at the predict step I run into problems with genemark and/or Augustus (as far as I can see the relevant folders/executables/scripts are in the PATH and the funannotate environmental variables are set properly). If I do not supply the genemark gtf from a separate run, it fails to run genemark. THe same for AUgustus. Now supplying the gtf/gff from separate runs of both programs still gives an error (index out of range) and that no high quality Augustus predictions were found, perhaps because the hints/evidence is mostly not covering the predicted transcripts to 90% or greater, see below). Do you have any idea what the "index out of range" means and how I can fix it or circumvent it? Also, how can I tell funannotate to be less strict about the HQ predictions?

Thanks in advance Eckart

System: Ubuntu 16.04 all dependencies/env vars installed (only ete3 has some python related warnings).

Commandline output while running funannotate

[09:49 PM]: OS: linux2, 112 cores, ~ 528 GB RAM. Python: 2.7.12
[09:49 PM]: Running funannotate v1.5.0
[09:49 PM]: AUGUSTUS (3.3) detected, version seems to be compatible with BRAKER and BUSCO
[09:49 PM]: Parsing GFF pass-through: /scratch/ek/euglossa/euglossa.annotation.phil.brand/snap.predicted.gff --> setting source to other_pred1
[09:49 PM]: Loading genome assembly and parsing soft-masked repetitive sequences
[09:50 PM]: Genome loaded: 22,698 scaffolds; 588,199,719 bp; 35.30% repeats masked
[09:50 PM]: Aligning transcript evidence to genome with minimap2
[09:56 PM]: Found 183,554 alignments, wrote GFF3 and Augustus hints to file
[09:56 PM]: Extracting hints from RNA-seq BAM file using bam2hints
[12:12 AM]: Mapping proteins to genome using Diamond blastx/Exonerate
[12:12 AM]: Using 645,086 proteins as queries
[12:12 AM]: Running Diamond pre-filter search
[12:20 AM]: Found 816,608 preliminary alignments
[12:44 AM]: Exonerate finished: found 31,634 alignments
[12:46 AM]: Pulling out high quality Augustus predictions
Traceback (most recent call last):
  File "/opt/funnotate/funannotate-1.5.1/bin/funannotate-predict.py", line 1074, in <module>
    if float(values[1]) > 89: #greater than ~90% of exons supported, this is really stringent which is what we want here, as we are going to weight these models 5 to 1 over genemark
IndexError: list index out of range

The predict.log says this:

[01/05/19 21:56:22]: Extracting hints from RNA-seq BAM file using bam2hints
[01/05/19 21:56:22]: /opt/augustus/bin/bam2hints --intronsonly --in /scratch/ek/euglossa/2018_03_01_hisat2_mapping/merged_bams/euglossa.all.hisat2.bam --out funannotate.test2/predict_misc/bam_hints.tmp
[01/06/19 00:12:08]: Wait a moment, calculating maximum block size that needs to be allocated... .. done
[01/06/19 00:12:19]: perl /opt/augustus/scripts/join_mult_hints.pl
[01/06/19 00:12:20]: perl /opt/funnotate/funannotate-1.5.1/util/BRAKER/filterIntronsFindStrand.pl /scratch/ek/euglossa/euglossa.annotation.phil.brand/funannotate.test2/predict_misc/genome.softmasked.fa funannotate.test2/predict_misc/bam_hints.joined.tmp --score
[01/06/19 00:12:34]: Mapping proteins to genome using Diamond blastx/Exonerate
[01/06/19 00:45:43]: perl /opt/augustus/scripts/join_mult_hints.pl
[01/06/19 00:45:48]: /opt/funnotate/funannotate-1.5.1/util/genemark_gtf2gff3.pl /scratch/ek/euglossa/euglossa.annotation.phil.brand/genemark.ES.gtf
[01/06/19 00:45:50]: perl /opt/EVidenceModeler-1.1.1/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl funannotate.test2/predict_misc/genemark.gff
[01/06/19 00:45:58]: perl /opt/EVidenceModeler-1.1.1/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl /scratch/ek/euglossa/euglossa.annotation.phil.brand/augustus.predicted.noStopInCDS.gff
[01/06/19 00:46:01]: Pulling out high quality Augustus predictions

My command was:

funannotate predict \
-i $FOLDER/Edil.repeats.softmasked.pb.fa \
-o funannotate.test2 \
-s "Euglossa dilemma" \
--genemark_gtf $FOLDER/genemark.ES.gtf \
--augustus_gff $FOLDER/augustus.predicted.noStopInCDS.gff \
--augustus_species Euglossa_dilemma \
--stringtie /scratch/ek/euglossa/euglossa_TX_assembly4_stringtie_genomeguided/eug33Tx_vs_Edil_v1_Tx.only_ref_overlapping.annotated.gtf \
--rna_bam /scratch/ek/euglossa/2018_03_01_hisat2_mapping/merged_bams/euglossa.all.hisat2.bam \
--other_gff $FOLDER/snap.predicted.gff \
--busco_db hymenoptera \
--cpus 100 \
--SeqCenter XXXX \
--repeat_filter overlap blast \
--organism other \
--ploidy 1 \
--protein_evidence \
$FUNANNOTATE_DB/uniprot_sprot.fasta \
/scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_protein.faa \
/scratch/genomes/genomes_original/Mqua/GCA_001276565.1_ASM127656v1_protein.faa \
/scratch/genomes/genomes_original/Bter/GCF_000214255.1_Bter_1.0_protein.faa \
/scratch/genomes/genomes_original/Amel/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_protein.faa \
/scratch/genomes/genomes_original/Bimp/GCF_000188095.2_BIMP_2.1_protein.faa \
--transcript_evidence \
$FOLDER/dil_vir_merged_transcriptome.fa \
/scratch/ek/euglossa/euglossa_TX_assembly2.4samples/BinPacker.fa \
/scratch/genomes/genomes_original/Edil/Edil_v1.0_transcripts.fa \
/scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_rna.fna \
/scratch/genomes/genomes_original/Mqua/GCA_001276565.1_ASM127656v1_rna_from_genomic.fna \
/scratch/genomes/genomes_original/Bter/GCF_000214255.1_Bter_1.0_rna.fna \
/scratch/genomes/genomes_original/Amel/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_rna.fna \
/scratch/genomes/genomes_original/Bimp/GCF_000188095.2_BIMP_2.1_rna.fna

When I run Augustus on its own, it seems to work fine (I use this Augustus instllation for BUSCO and trained it already for my species. Following your suggestion in a previous Github issue in here i am running Augustus right now like this (runs fine thus far, but hints/evidences are not covering the transcripts very well it seems):

augustus --species=Euglossa_dilemma --hintsfile=predict_misc/hints.ALL.gff  \
   --gff3=on --UTR=off --progress=true \
   --extrinsicCfgFile=$AUGUSTUS_CONFIG_PATH/extrinsic/extrinsic.E.XNT.cfg \
   --stopCodonExcludedFromCDS=False predict_misc/genome.softmasked.fa

Evidence for and against this transcript:

 % of transcript supported by hints (any source): 43.2
 CDS exons: 19/73
      E:   8 
    XNT:  12 
 CDS introns: 44/73
     E:  33 
    XNT:  11 
 5'UTR exons and introns: 0/0
 3'UTR exons and introns: 0/0
 hint groups fully obeyed: 52
      E:  39 
    XNT:  13 
 incompatible hint groups: 124
      E:  98 (minimap2_5307,minimap2_5308,minimap2_5310,minimap2_5311,minimap2_5312,minimap2_5313,...)
    XNT:  26 (RS14_TOBAC,RS14_CANAX,RS141_MAIZE)
 end gene g5

nextgenusfs commented 5 years ago

Hi Eckart. It seems like its getting hung up when parsing your Augustus GFF3 file. Funannotate runs Augustus is paraellel and then merges the output, so I don't think letting it re-run Augustus will take that much time. To make sure your training parameters are in a format that funanntoate can use, try funannotate species and verify that Euglossa_dilemma is in that list. Then go into the funannotate output directory and delete the augustus related files in funannotate_out/predict_misc -- these should all start with augustus.... By default funannotate re-uses any valid files that are found in this folder, so when changing settings you need to manually clear a few files. Then you can try to re-run your command like this (just leave out the Augustus results) which will re-use existing data and then run Augustus:

funannotate predict \
-i $FOLDER/Edil.repeats.softmasked.pb.fa \
-o funannotate.test2 \
-s "Euglossa dilemma" \
--genemark_gtf $FOLDER/genemark.ES.gtf \
--augustus_species Euglossa_dilemma \
--stringtie /scratch/ek/euglossa/euglossa_TX_assembly4_stringtie_genomeguided/eug33Tx_vs_Edil_v1_Tx.only_ref_overlapping.annotated.gtf \
--rna_bam /scratch/ek/euglossa/2018_03_01_hisat2_mapping/merged_bams/euglossa.all.hisat2.bam \
--other_gff $FOLDER/snap.predicted.gff \
--busco_db hymenoptera \
--cpus 100 \
--SeqCenter XXXX \
--repeat_filter overlap blast \
--organism other \
--ploidy 1 \
--protein_evidence \
$FUNANNOTATE_DB/uniprot_sprot.fasta \
/scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_protein.faa \
/scratch/genomes/genomes_original/Mqua/GCA_001276565.1_ASM127656v1_protein.faa \
/scratch/genomes/genomes_original/Bter/GCF_000214255.1_Bter_1.0_protein.faa \
/scratch/genomes/genomes_original/Amel/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_protein.faa \
/scratch/genomes/genomes_original/Bimp/GCF_000188095.2_BIMP_2.1_protein.faa \
--transcript_evidence \
$FOLDER/dil_vir_merged_transcriptome.fa \
/scratch/ek/euglossa/euglossa_TX_assembly2.4samples/BinPacker.fa \
/scratch/genomes/genomes_original/Edil/Edil_v1.0_transcripts.fa \
/scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_rna.fna \
/scratch/genomes/genomes_original/Mqua/GCA_001276565.1_ASM127656v1_rna_from_genomic.fna \
/scratch/genomes/genomes_original/Bter/GCF_000214255.1_Bter_1.0_rna.fna \
/scratch/genomes/genomes_original/Amel/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_rna.fna \
/scratch/genomes/genomes_original/Bimp/GCF_000188095.2_BIMP_2.1_rna.fna

Let me know if you are getting the same error then if funannotate runs Augustus.

estolle commented 5 years ago

Thanks for the quick answer.

i tried around in the meantime to also find out if the input data could cause trouble (some of the protein/transcript evidence comes from related species).

With reduced input it ran through! =) Augustus ran very fast indeed. the Euglossa_dilemma species is the one I added myself after training based on the ~4000 BUSCOs (it is in the list and works)

There were 2 issues I could fix (perhaps worth noting in the INSTALL prerequesites I had to specify QUARRY_PATH=/opt/CodingQuarry_v2.0/QuarryFiles so that funannotate could proceed with running CodingQuarry. Also, Genemark failed to run, but I could run it manually --> but only after I reduced the number of cores specified with --cores to 60 (from --cores=100). Looking into the genemark script (/opt/gm_et_linux_64/gmes_petap/gmes_petap.pl) I found an option where it checks whether cores > 64 and then throws and error that --cores is out of range. I edited the file and can confirm it works manually (thus should work within funannotate as well).

my reduced command:( funannotate predict \ -i $FOLDER/Edil.repeats.softmasked.pb.fa \ -o funannotate.test3 \ -s "Euglossa dilemma" \ --augustus_species Euglossa_dilemma \ --rna_bam /scratch/ek/euglossa/2018_03_01_hisat2_mapping/merged_bams/euglossa.all.hisat2.bam \ --busco_db hymenoptera \ --cpus 100 \ --SeqCenter XXXX \ --repeat_filter overlap blast \ --organism other \ --ploidy 1 \ --protein_evidence \ $FUNANNOTATE_DB/uniprot_sprot.fasta \ /scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_protein.faa \ /scratch/genomes/genomes_original/Mqua/GCA_001276565.1_ASM127656v1_protein.faa \ /scratch/genomes/genomes_original/Bter/GCF_000214255.1_Bter_1.0_protein.faa \ /scratch/genomes/genomes_original/Amel/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_protein.faa \ /scratch/genomes/genomes_original/Bimp/GCF_000188095.2_BIMP_2.1_protein.faa \ --transcript_evidence \ $FOLDER/dil_vir_merged_transcriptome.fa \ /scratch/ek/euglossa/euglossa_TX_assembly2.4samples/BinPacker.fa \ /scratch/genomes/genomes_original/Edil/Edil_v1.0_transcripts.fa \ )

Some results:

CodeQuarry models (10): 27,282 Augustus models (1): 2,437 Genemark models (1): 0 HiQ models (5): 19 Pasa models (1): 0 Total models: 29,738

[01:46 AM]: Generating protein fasta files from 17,905 EVM models [01:46 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [01:46 AM]: Found 3,043 gene models to remove: 72 too short; 19 span gaps; 3,899 transposable elements [01:46 AM]: 14,862 gene models remaining

Do these numbers sound OK to you? I am a little puzzled that there are only 19 HQ Augustus predictions (in the augustus GFF I can see that most predictions are supported by maybe around 50% evidence (length covered)). I expected more, particularly since BUSCO found >3500 complete single copy orthologs. Also the total Augustus predictions seem too low given the previous BUSCO numbers. Do you know a way to improve this?

nextgenusfs commented 5 years ago

Thanks for the feedback. Just pushed https://github.com/nextgenusfs/funannotate/commit/f3e9f3b75fb4bb8450f24c7d14d2c4d985055d95 which will limit gene mark to 64 cpus -- could you retry and see if it runs now?

CodingQuarry may not be helpful for you -- I know that it over predicts in fungi (also tends to produce short gene models due to the string tie training set), but with the smaller genomes the tradeoff of finding a few novel genes might be worth it, but with higher eukaryotes and larger genomes, I'm not sure how helpful it is. Thus I haven't made it a core part of funannotate. If you re-run and let it run GeneMark, then looking at the number of models predicted by GeneMark should be informative to how much CodingQuarry may be over prediction.

I would also like to see what happens if you let funannotate train Augustus -- there are a lot of checks/double-checks that funannotate runs to filter the training sets, this might give you a better result.

You may also think about using the funannotate train module to run PASA. I find that the PASA data are typically better at training Augustus than the BUSCO results.

estolle commented 5 years ago

Thanks alot for fixing the genemark cpu option. It works now.

As you suggested I ran funannotate again and let it train augustus. <( funannotate predict \ -i $FOLDER/Edil.repeats.softmasked.pb.fa \ -o funannotate.test4 \ -s "Euglossa viridissima" \ --busco_seed_species Euglossa_dilemma \ --rna_bam /scratch/ek/euglossa/2018_03_01_hisat2_mapping/merged_bams/euglossa.all.hisat2.bam \ --busco_db hymenoptera \ --cpus 100 \ --genemark_mode ES \ --SeqCenter XXXX \ --repeat_filter overlap blast \ --organism other \ --ploidy 1 \ --stringtie /scratch/ek/euglossa/euglossa_TX_assembly4_stringtie_genomeguided/eug33Tx_vs_Edil_v1_Tx.only_ref_overlapping.annotated.gtf \ --other_gff $FOLDER/snap.predicted.gff \ --protein_evidence \ $FUNANNOTATE_DB/uniprot_sprot.fasta \ /scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_protein.faa \ /scratch/genomes/genomes_original/Mqua/GCA_001276565.1_ASM127656v1_protein.faa \ /scratch/genomes/genomes_original/Bter/GCF_000214255.1_Bter_1.0_protein.faa \ /scratch/genomes/genomes_original/Amel/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_protein.faa \ /scratch/genomes/genomes_original/Bimp/GCF_000188095.2_BIMP_2.1_protein.faa \ --transcript_evidence \ $FOLDER/dil_vir_merged_transcriptome.fa \ /scratch/ek/euglossa/euglossa_TX_assembly2.4samples/BinPacker.fa \ /scratch/genomes/genomes_original/Edil/Edil_v1.0_transcripts.fa \ /scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_rna.fna )>

This gave me a better result actually:

[05:13 PM]: OS: linux2, 112 cores, ~ 528 GB RAM. Python: 2.7.12 [05:13 PM]: Running funannotate v1.5.0 [05:13 PM]: AUGUSTUS (3.3) detected, version seems to be compatible with BRAKER and BUSCO [05:13 PM]: Parsing GFF pass-through: /scratch/ek/euglossa/euglossa.annotation.phil.brand/snap.predicted.gff --> setting source to other_pred1 [05:13 PM]: Loading genome assembly and parsing soft-masked repetitive sequences [05:14 PM]: Genome loaded: 22,698 scaffolds; 588,199,719 bp; 35.30% repeats masked [05:14 PM]: Aligning transcript evidence to genome with minimap2 [05:16 PM]: Found 136,041 alignments, wrote GFF3 and Augustus hints to file [05:16 PM]: Extracting hints from RNA-seq BAM file using bam2hints [07:40 PM]: Mapping proteins to genome using Diamond blastx/Exonerate [07:40 PM]: Using 645,086 proteins as queries [07:40 PM]: Running Diamond pre-filter search [07:48 PM]: Found 816,608 preliminary alignments [08:13 PM]: Exonerate finished: found 31,597 alignments [08:13 PM]: Running GeneMark-ES on assembly [08:51 PM]: Converting GeneMark GTF file to GFF3 [08:51 PM]: Found 40,822 gene models [08:51 PM]: Running BUSCO to find conserved gene models for training Augustus [10:55 PM]: 4,067 valid BUSCO predictions found, now formatting for EVM [11:00 PM]: Setting up EVM partitions [11:08 PM]: Generating EVM command list [11:08 PM]: Running EVM commands with 99 CPUs [11:12 PM]: Combining partitioned EVM outputs [11:12 PM]: Converting EVM output to GFF3 [11:44 PM]: Collecting all EVM results [11:44 PM]: 2,082 total gene models from EVM [11:44 PM]: Checking BUSCO protein models for accuracy [11:48 PM]: 1,946 gene models validated, using for training Augustus [11:48 PM]: Training Augustus using BUSCO gene models [11:53 PM]: Augustus initial training results (specificity/sensitivity): nucleotides (92.7%/82.3%); exons (70.2%/64.5%); genes (0.0%/0.0%). [11:53 PM]: Accuracy seems low, you can try to improve by passing the --optimize_augustus option. [11:53 PM]: Running Augustus gene prediction [12:50 AM]: Found 11,873 gene models [12:50 AM]: Pulling out high quality Augustus predictions [12:50 AM]: Found 5,267 high quality predictions from Augustus (>90% exon evidence) [12:50 AM]: CodingQuarry installed --> running CodingQuarry prediction on RNA-seq alignments [04:48 AM]: Summary of gene models passed to EVM (weights):

CodeQuarry models (10): 21,826 Augustus models (1): 6,606 Genemark models (1): 40,822 HiQ models (5): 5,267 Pasa models (1): 0 Total models: 74,521

[04:48 AM]: Setting up EVM partitions [04:57 AM]: Generating EVM command list [04:57 AM]: Running EVM commands with 99 CPUs [05:09 AM]: Combining partitioned EVM outputs [05:09 AM]: Converting EVM output to GFF3 [05:42 AM]: Collecting all EVM results [05:42 AM]: 42,756 total gene models from EVM [05:42 AM]: Generating protein fasta files from 42,756 EVM models [05:42 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [05:43 AM]: Found 9,311 gene models to remove: 41 too short; 34 span gaps; 10,725 transposable elements [05:43 AM]: 33,445 gene models remaining [05:43 AM]: Predicting tRNAs ^[[B[11:52 AM]: Found 185 tRNA gene models [11:52 AM]: 123 tRNAscan models are valid (non-overlapping) [11:52 AM]: Generating GenBank tbl annotation file [11:53 AM]: Converting to final Genbank format [12:06 PM]: Collecting final annotation files for 33,568 total gene models [12:06 PM]: Funannotate predict is finished, output files are in the funannotate.test4/predict_results folder [12:06 PM]: Your next step to capture UTRs and update annotation using PASA:

( funannotate update -i funannotate.test4 --cpus 100 \ --left /scratch/ek/reads/euglossa/euglossa.R1.fastq.gz \ --right /scratch/ek/reads/euglossa/euglossa.R2.fastq.gz \ --jaccard_clip --memory 200G --stranded no --pasa_db mysql \ --species "Euglossa viridissima2" --max_intronlen 5000 --pasa_alignment_overlap 50.0 --coverage 50 \ --trinity /scratch/ek/euglossa/euglossa.annotation.phil.brand/dil_vir_merged_transcriptome.fa )

[01:43 PM]: OS: linux2, 112 cores, ~ 528 GB RAM. Python: 2.7.12 [01:43 PM]: Running funannotate v1.5.0 [01:43 PM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt' [01:44 PM]: Reannotating Euglossa viridissima, NCBI accession: None [01:44 PM]: Previous annotation consists of: 33,445 protein coding gene models and 123 non-coding gene models [01:44 PM]: Trimmomatic will be skipped [01:44 PM]: Read normalization will be skipped [01:44 PM]: Parsing assembled trinity data: /scratch/ek/euglossa/euglossa.annotation.phil.brand/dil_vir_merged_transcriptome.fa [01:45 PM]: Converting transcript alignments to GFF3 format [01:45 PM]: Converting Trinity transcript alignments to GFF3 format [01:45 PM]: Running PASA alignment step using 118,041 transcripts [07:30 PM]: Running PASA annotation comparison step 1 [11:41 PM]: Running PASA annotation comparison step 2 [04:13 AM]: Using Kallisto TPM data to determine which PASA gene models to select at each locus [04:13 AM]: Building Kallisto index [04:15 AM]: Mapping reads using pseudoalignment in Kallisto [05:32 AM]: Parsing Kallisto results. Keeping alt-splicing transcipts if expressed at least 10.0% of highest transcript per locus. [05:32 AM]: Wrote 33,585 transcripts derived from 33,483 protein coding loci. [05:32 AM]: Validating gene models (renaming, checking translations, filtering, etc) [05:32 AM]: Writing 33,604 loci to TBL format: dropped 0 overlapping, 3 too short, and 0 frameshift gene models [05:32 AM]: Converting to Genbank format [05:39 AM]: Collecting final annotation files [05:40 AM]: Parsing GenBank files...comparing annotation [05:47 AM]: Updated annotation complete:

Total Gene Models: 33,604 Total transcripts: 33,698 New Gene Models: 299 No Change: 28,495 Update UTRs: 4,755 Exons Changed: 49 Exons/CDS Changed: 6 Dropped Models: 0 CDS AED: 0.005 mRNA AED: 0.028

[05:47 AM]: Funannotate update is finished, output files are in the funannotate.test4/update_results folder [05:47 AM]: There are 1 gene models that need to be fixed.

FUN_013347 Feature begins or ends in gap starting at 468722

[05:47 AM]: Manually edit the tbl file funannotate.test4/update_results/Euglossa_viridissima2.tbl, then run:

funannotate fix -i funannotate.test4/update_results/Euglossa_viridissima2.gbk -t funannotate.test4/update_results/Euglossa_viridissima2.tbl

[05:47 AM]: After the problematic gene models are fixed, you can proceed with functional annotation. [05:47 AM]: Your next step might be functional annotation, suggested commands:

I also tried running it with trainining (PASA) first: <( funannotate train --cpus 100 --species "Euglossa viridissima2" --max_intronlen 5000 --pasa_alignment_overlap 50.0 --pasa_db mysql --coverage 50 --memory 200G --stranded no \ --trinity /scratch/ek/euglossa/euglossa.annotation.phil.brand/dil_vir_merged_transcriptome.fa \ --input /scratch/ek/euglossa/euglossa.annotation.phil.brand/funannotate.test1/predict_misc/genome.softmasked.fa \ --out funannotate.train1 \ --left /scratch/ek/reads/euglossa/euglossa.R1.fastq.gz \ --right /scratch/ek/reads/euglossa/euglossa.R2.fastq.gz )>

The results seem to be even better, however, using this trainingset, then Augustus is using only 900ish models and discards the other few thousand and complains about low accurary. Not sure how to interprete this, the final numbers look good (just regarding the numbers).

[11:29 PM]: OS: linux2, 112 cores, ~ 528 GB RAM. Python: 2.7.12 [11:29 PM]: Running funannotate v1.5.0 [11:29 PM]: Trimmomatic will be skipped [11:29 PM]: Read normalization will be skipped [11:29 PM]: Parsing assembled trinity data : /scratch/ek/euglossa/euglossa.annotation.phil.brand/dil_vir_merged_transcriptome.fa [11:29 PM]: Converting transcript alignments to GFF3 format [11:30 PM]: Converting Trinity transcript alignments to GFF3 format [11:30 PM]: Running PASA alignment step using 118,041 transcripts [04:30 AM]: PASA assigned 85,152 transcipts to 84,408 loci (genes) [04:30 AM]: Getting PASA models for training with TransDecoder [04:36 AM]: PASA finished. PASAweb accessible via: localhost:port/cgi-bin/index.cgi?db=Euglossa_viridissima2 [04:36 AM]: Using Kallisto TPM data to determine which PASA gene models to select at each locus [04:36 AM]: Building Kallisto index [04:38 AM]: Mapping reads using pseudoalignment in Kallisto [05:54 AM]: Parsing expression value results. Keeping best transcript at each locus. [05:54 AM]: Wrote 9,911 PASA gene models [05:54 AM]: PASA database name: Euglossa_viridissima2 [05:54 AM]: Trinity/PASA has completed, you are now ready to run funanotate predict, for example:

( funannotate predict \ -i /scratch/ek/euglossa/euglossa.annotation.phil.brand/funannotate.test1/predict_misc/genome.softmasked.fa \ -o funannotate.train1 \ -s "Euglossa viridissima2" \ --busco_seed_species Euglossa_dilemma \ --rna_bam /scratch/ek/euglossa/2018_03_01_hisat2_mapping/merged_bams/euglossa.all.hisat2.bam \ --busco_db hymenoptera \ --cpus 100 \ --SeqCenter XXXX \ --repeat_filter overlap blast \ --organism other \ --ploidy 1 \ --stringtie /scratch/ek/euglossa/euglossa_TX_assembly4_stringtie_genomeguided/eug33Tx_vs_Edil_v1_Tx.only_ref_overlapping.annotated.gtf \ --other_gff $FOLDER/snap.predicted.gff \ --protein_evidence \ $FUNANNOTATE_DB/uniprot_sprot.fasta \ /scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_protein.faa \ /scratch/genomes/genomes_original/Mqua/GCA_001276565.1_ASM127656v1_protein.faa \ /scratch/genomes/genomes_original/Bter/GCF_000214255.1_Bter_1.0_protein.faa \ /scratch/genomes/genomes_original/Amel/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_protein.faa \ /scratch/genomes/genomes_original/Bimp/GCF_000188095.2_BIMP_2.1_protein.faa \ --transcript_evidence \ $FOLDER/dil_vir_merged_transcriptome.fa \ /scratch/ek/euglossa/euglossa_TX_assembly2.4samples/BinPacker.fa \ /scratch/genomes/genomes_original/Edil/Edil_v1.0_transcripts.fa \ /scratch/genomes/genomes_original/Emex/GCF_001483705.1_ASM148370v1_rna.fna )

[01:18 PM]: OS: linux2, 112 cores, ~ 528 GB RAM. Python: 2.7.12 [01:18 PM]: Running funannotate v1.5.0 [01:18 PM]: Found training files, will re-use these files: --pasa_gff funannotate.train1/training/funannotate_train.pasa.gff3 --transcript_alignments funannotate.train1/training/funannotate_train.transcripts.gff3 [01:18 PM]: AUGUSTUS (3.3) detected, version seems to be compatible with BRAKER and BUSCO [01:18 PM]: Parsing GFF pass-through: /scratch/ek/euglossa/euglossa.annotation.phil.brand/snap.predicted.gff --> setting source to other_pred1 [01:18 PM]: Loading genome assembly and parsing soft-masked repetitive sequences [01:19 PM]: Genome loaded: 22,698 scaffolds; 588,199,719 bp; 35.30% repeats masked [01:19 PM]: Existing transcript alignments found: funannotate.train1/predict_misc/transcript_alignments.gff3 [01:19 PM]: Extracting hints from RNA-seq BAM file using bam2hints [03:37 PM]: Mapping proteins to genome using Diamond blastx/Exonerate [03:37 PM]: Using 645,086 proteins as queries [03:37 PM]: Running Diamond pre-filter search [03:44 PM]: Found 816,608 preliminary alignments [04:12 PM]: Exonerate finished: found 31,644 alignments [04:12 PM]: Training Augustus using PASA data. [04:13 PM]: 788 of 9,911 models pass training parameters replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs replaced tx with 0 MEA txs [04:13 PM]: Augustus initial training results (specificity/sensitivity): nucleotides (98.0%/92.4%); exons (84.9%/82.4%); genes (43.0%/40.6%). [04:13 PM]: Accuracy seems low, you can try to improve by passing the --optimize_augustus option. [04:13 PM]: Running Augustus gene prediction [05:16 PM]: Found 14,834 gene models [05:17 PM]: Running GeneMark-ES on assembly [06:09 PM]: Converting GeneMark GTF file to GFF3 [06:09 PM]: Found 40,822 gene models [06:09 PM]: Pulling out high quality Augustus predictions [06:09 PM]: Found 6,331 high quality predictions from Augustus (>90% exon evidence) [06:09 PM]: CodingQuarry installed --> running CodingQuarry prediction on RNA-seq alignments [10:13 PM]: Summary of gene models passed to EVM (weights):

CodeQuarry models (10): 21,826 Augustus models (1): 8,503 Genemark models (1): 40,822 HiQ models (5): 6,331 PASA models (10): 9,911 Total models: 87,393

[10:13 PM]: Setting up EVM partitions [10:22 PM]: Generating EVM command list [10:22 PM]: Running EVM commands with 99 CPUs [10:35 PM]: Combining partitioned EVM outputs [10:35 PM]: Converting EVM output to GFF3 [11:07 PM]: Collecting all EVM results [11:07 PM]: 44,935 total gene models from EVM [11:07 PM]: Generating protein fasta files from 44,935 EVM models [11:07 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [11:08 PM]: Found 9,328 gene models to remove: 40 too short; 37 span gaps; 10,748 transposable elements [11:08 PM]: 35,607 gene models remaining [11:08 PM]: Predicting tRNAs [04:57 AM]: Found 185 tRNA gene models [04:57 AM]: 128 tRNAscan models are valid (non-overlapping) [04:57 AM]: Generating GenBank tbl annotation file [04:57 AM]: Converting to final Genbank format [05:11 AM]: Collecting final annotation files for 35,735 total gene models [05:11 AM]: Funannotate predict is finished, output files are in the funannotate.train1/predict_results folder [05:11 AM]: Your next step to capture UTRs and update annotation using PASA:

funannotate update -i funannotate.train1 --cpus 100

( funannotate update -i funannotate.train1 --cpus 100 \ --left /scratch/ek/reads/euglossa/euglossa.R1.fastq.gz \ --right /scratch/ek/reads/euglossa/euglossa.R2.fastq.gz \ --jaccard_clip --memory 200G --stranded no --pasa_db mysql \ --species "Euglossa viridissima2" --max_intronlen 5000 --pasa_alignment_overlap 50.0 --coverage 50 \ --trinity /scratch/ek/euglossa/euglossa.annotation.phil.brand/dil_vir_merged_transcriptome.fa )

after running this last step (funannotate update) i got an error (funannotate update with basically the same commands/inputs worked befoe in the first example I gave above):

[05:27 PM]: OS: linux2, 112 cores, ~ 528 GB RAM. Python: 2.7.12 [05:27 PM]: Running funannotate v1.5.0 [05:27 PM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt' [05:27 PM]: Found relevent files in funannotate.train1/training, will re-use them: Forward reads: funannotate.train1/training/left.fq.gz Reverse reads: funannotate.train1/training/right.fq.gz Trinity results: funannotate.train1/training/funannotate_train.trinity-GG.fasta PASA config file: funannotate.train1/training/pasa/alignAssembly.txt [05:28 PM]: Reannotating Euglossa viridissima2, NCBI accession: None [05:28 PM]: Previous annotation consists of: 35,607 protein coding gene models and 128 non-coding gene models [05:28 PM]: Trimmomatic will be skipped [05:28 PM]: Read normalization will be skipped Traceback (most recent call last): File "/opt/funnotate/funannotate-1.5.1/bin/funannotate-update.py", line 1866, in if lib.which('stringtie') and lib.checkannotations(shortBAM): NameError: name 'shortBAM' is not defined

Do you know what this means? I am a bit puzzled about the error since it worked before with this command/input, but on the other run where I just did training with augustus (no PASA involved).

Another question for me was whether I should run genemark ES or ET - I ran both and they find 40k vs 28 or 29 k models, so the difference is not big. Do you know if the genemark predictions are alot better when they come from ET (with transcript evidence)? It also seems that genemark find much more gene models than codingquarry, almost twice as much. In the end around 35k remain. I have to look up if this a reasonable number.

estolle commented 5 years ago

sorry the formatting is off, the options here in the webbrowser form are not well visible/clear

estolle commented 5 years ago

and github doesn't let me edit my posts

nextgenusfs / funannotate

funannotate predict failing at: Extraction of high quality Augustus predictions // IndexError: list index out of range #252

Some results:

This gave me a better result actually:

CodeQuarry models (10): 21,826 Augustus models (1): 6,606 Genemark models (1): 40,822 HiQ models (5): 5,267 Pasa models (1): 0 Total models: 74,521

Total Gene Models: 33,604 Total transcripts: 33,698 New Gene Models: 299 No Change: 28,495 Update UTRs: 4,755 Exons Changed: 49 Exons/CDS Changed: 6 Dropped Models: 0 CDS AED: 0.005 mRNA AED: 0.028

[05:47 AM]: Funannotate update is finished, output files are in the funannotate.test4/update_results folder [05:47 AM]: There are 1 gene models that need to be fixed.

FUN_013347 Feature begins or ends in gap starting at 468722

The results seem to be even better, however, using this trainingset, then Augustus is using only 900ish models and discards the other few thousand and complains about low accurary. Not sure how to interprete this, the final numbers look good (just regarding the numbers).

CodeQuarry models (10): 21,826 Augustus models (1): 8,503 Genemark models (1): 40,822 HiQ models (5): 6,331 PASA models (10): 9,911 Total models: 87,393

funannotate update -i funannotate.train1 --cpus 100