Open jasongallant opened 3 months ago
What seems to happen for larger eukaryotes is that codingquarry and snap don't work very well. Can you check the logs and see what the number of genes predicted for each took prior to EVM is? Might need to turn off a few of the ab initio models. Also, is your assembly haploid?
Yes, it's a haploid assembly. I took extra care also by running funnotate clean and funnotate sort prior to the pipeline if that's helpful
I think the portion of the log you are asking for is here (so many logs!)
[Mar 05 06:09 PM]: Running Augustus gene prediction using apteronotus_albifrons parameters
[Mar 05 07:15 PM]: 51,732 predictions from Augustus
Progress: 0 complete, 0 failed, 1471 remaining ^M Progress: 0 complete, 0 failed, 1471 remaining ^M Progress: 0 complete,$
[Mar 05 07:15 PM]: Pulling out high quality Augustus predictions
[Mar 05 07:15 PM]: Found 13,306 high quality predictions from Augustus (>90% exon evidence)
[Mar 05 07:15 PM]: Running SNAP gene prediction, using training data: apteronotus_train/predict_misc/final_training_models.gff3
[Mar 05 07:58 PM]: 104,593 predictions from SNAP
[Mar 05 07:58 PM]: Running GlimmerHMM gene prediction, using training data: apteronotus_train/predict_misc/final_training_models.gff3
[Mar 05 09:05 PM]: 106,875 predictions from GlimmerHMM
[Mar 05 09:05 PM]: Summary of gene models passed to EVM (weights):
[Mar 05 09:05 PM]: EVM: partitioning input to ~ 35 genes per partition using min 1500 bp interval
[Mar 06 12:41 AM]: Converting to GFF3 and collecting all EVM results
Augustus 1 38426
Augustus HiQ 2 13306
GeneMark 1 78608
GlimmerHMM 1 106875
pasa 6 31714
snap 1 104593
Total - 373522
[Mar 06 12:41 AM]: 51,654 total gene models from EVM
[Mar 06 12:41 AM]: Generating protein fasta files from 51,654 EVM models
[Mar 06 12:42 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[Mar 06 12:42 AM]: Found 1,420 gene models to remove: 33 too short; 0 span gaps; 1,387 transposable elements
[Mar 06 12:42 AM]: 50,234 gene models remaining
[Mar 06 12:42 AM]: Predicting tRNAs
[Mar 06 01:16 AM]: 7,818 tRNAscan models are valid (non-overlapping)
So it looks like snap and glimmerhmm aren't performing very well. So you can turn those off by passing --weights snap:0 glimmerhmm:0
to your funannotate predict
command. It should re-use existing data and re-run EVM with the new weights, it will effectively ignore the snap and glimmerhmm models.
If you are more "adventurous" and would perhaps tolerate a few bugs, I've been slowly working on funannotate2
https://github.com/nextgenusfs/funannotate2. The training module and predict are mostly complete. I've re-written mostly everything to hopefully work better and have also written a new consensus gene calling method. Largely I'm trying to reduce dependencies and reduce complexities, funannotate
tries to do too many things and install becomes difficult. I'd be interested to know if it is working better or not. Let me know if you'd like to give it a try and I can give you some instructions.
OK, I'll give those settings a shot and see what happens.
But yes, I'm feeling adventurous (adventure is my middle name! ha-- it's actually Raymond). Send me some info, and I'll be happy to be a test pilot. I totally understand the Rube Goldberg nature of things, and appreciate your efforts to simplify. If I can be of any assistance, let me know!
Okay, so for f2
you can do this to install with mamba/conda/pip:
# install external dependencies
mamba create -n funannotate2 "python<=3.10" biopython "evidencemodeler>=2" minimap2 miniprot snap "augustus==3.5.0" glimmerhmm diamond trnascan-se table2asn
# activate the environment
conda activate funannotate2
# install latest from git
python -m pip install git+https://github.com/nextgenusfs/funannotate2.git
# need to also set the env variable pointing to f1 FUNANNOTATE_DB
export FUNANNOTATE_DB=/path/to/existing/funannotate_db/folder
You could probably also just install it in the same environment as your existing funannotate installation, and then would just need to add miniprot
and table2asn
via conda/mamba.
The help menus should all look and feel familiar
$ funannotate2
usage: funannotate2 [-h] [--version] {clean,train,predict} ...
Funannotate: eukaryotic genome annotation pipeline
Commands:
clean Find and remove duplicated contigs, sort by size, rename headers.
train Train ab intio gene prediction algorithms.
predict Predict primary gene models in a eukaryotic genome.
Help:
-h, --help Show this help message and exit
--version show program's version number and exit
One difference is that funannotate2 train
will run all of the training steps and it takes an optional -t, --training-set
in GFF3 format -- if a training-set is not passed then it will run buscolite
to generate a training set (the tool will auto-detect the appropriate BUSCO models and download them based on the taxonomy you provide via the -s, --species
parameter, so don't put a dummy value in there. Then those gene models will be used to train all of the ab-initio predictors. The output of train is then a JSON file that contains all of the training parameters that will be used by funannotate2 predict
.
$ funannotate2 train
usage: funannotate2 train -f -s -o [-t] [--strain] [--cpus] [--optimize-augustus] [--header-len] [-h] [--version]
Gene model training from automated training of ab initio gene prediction algorithms:
Required arguments:
-f , --fasta genome in FASTA format
-s , --species Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
-o , --out Output folder name
Optional arguments:
-t , --training-set Training set to use in GFF3 format
--strain Strain/isolate name
--cpus Number of CPUs to use (default: 2)
--optimize-augustus Run Augustus mediated optimized training (not recommended) (default: False)
--header-len Max length for fasta headers (default: 100)
Other arguments:
-h, --help show this help message and exit
--version show program's version number and exit
Example of a training parameters JSON output --> written to output/train_results/<species>.params.json
:
{
"name": "ophidiomyces_ophidiicola",
"species": "Ophidiomyces ophidiicola",
"taxonomy": {
"superkingdom": "Eukaryota",
"kingdom": "Fungi",
"phylum": "Ascomycota",
"class": "Eurotiomycetes",
"order": "Onygenales",
"family": "Onygenaceae",
"genus": "Ophidiomyces",
"species": "Ophidiomyces ophidiicola"
},
"busco-lineage": "/usr/local/share/funannotate/onygenales_odb10",
"assembly": {
"n_contigs": 115,
"size": 21919615,
"n50": 506472,
"n90": 122144,
"l50": 12,
"l90": 42,
"avg_length": 190605
},
"abinitio": {
"augustus": {
"location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/augustus",
"species": "7975c7f5-2f76-445e-8880-7c021e5cb032",
"train_results": {
"tool": "augustus",
"model": "7975c7f5-2f76-445e-8880-7c021e5cb032",
"n_test_genes": 200,
"ref_genes_found": 198,
"ref_genes_missed": 2,
"extra_query_genes": 120,
"average_aed": 0.11175562640178095,
"nucleotide_sensitivity": 0.9075853572557586,
"nucleotide_precision": 0.9036721113763718,
"exon_sensitivity": 0.6288659793814433,
"exon_precision": 0.6606332098600142,
"gene_sensitivity": 0.9753086419753086,
"gene_precision": 0.3969849246231156
},
"training_set": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/training-models.train.gff3"
},
"glimmerhmm": {
"location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/glimmerhmm/train",
"train_results": {
"tool": "glimmerhmm",
"model": "train",
"n_test_genes": 200,
"ref_genes_found": 199,
"ref_genes_missed": 1,
"extra_query_genes": 178,
"average_aed": 0.14522900680209533,
"nucleotide_sensitivity": 0.8494452668122087,
"nucleotide_precision": 0.9028987277614996,
"exon_sensitivity": 0.5104712041884817,
"exon_precision": 0.5557903266018449,
"gene_sensitivity": 0.9821428571428571,
"gene_precision": 0.23605150214592274
},
"training_set": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/training-models.train.gff3"
},
"snap": {
"location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/snap/snap-trained.hmm",
"train_results": {
"tool": "snap",
"model": "snap-trained.hmm",
"n_test_genes": 200,
"ref_genes_found": 195,
"ref_genes_missed": 5,
"extra_query_genes": 127,
"average_aed": 0.18329479111453992,
"nucleotide_sensitivity": 0.8492347150671552,
"nucleotide_precision": 0.8563979566847235,
"exon_sensitivity": 0.31283422459893045,
"exon_precision": 0.2723601561836857,
"gene_sensitivity": 0.16666666666666666,
"gene_precision": 0.0078125
},
"training_set": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/training-models.train.gff3"
},
"genemark": {
"location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/genemark/gmhmm.mod",
"train_results": {
"tool": "genemark",
"model": "gmhmm.mod",
"n_test_genes": 200,
"ref_genes_found": 200,
"ref_genes_missed": 0,
"extra_query_genes": 196,
"average_aed": 0.09756846438881216,
"nucleotide_sensitivity": 0.898724540750394,
"nucleotide_precision": 0.9344481427857675,
"exon_sensitivity": 0.6243654822335025,
"exon_precision": 0.6643433977190323,
"gene_sensitivity": 1.0,
"gene_precision": 0.2924187725631769
},
"training_set": "self training"
}
},
"date": "2023-10-26 12:03:48.138710",
"user": "jon"
}
So with your data and since you have already generated RNA-seq mediated Trinity/PASA, I would pass the PASA GFF3 file to -t, --training-set
. After training is complete then you can run funannotate2 predict
which again should look mostly familiar -- currently you will need to pass the full path to the training JSON params file (eventually I'll make this "easy/automatic" as before). Also for comparison sake, the latest of funannotate1
can take this params JSON file as input to predict as well. But what is new in funannotate2 predict
is a more efficient evidence alignment as well as a new consensus gene calling program.
$ funannotate2 predict
usage: funannotate2 predict -f -o -p -s [-st] [-w] [-m] [-ps] [-ts] [-c] [-mi] [-hl] [-l] [-n] [-pt] [-t] [-h] [--version]
Gene model prediction from automated training of ab initio gene prediction algorithms:
Required arguments:
-f , --fasta genome in FASTA format (softmasked repeats)
-o , --out Output folder name
-p , --params , --pretrained Params.json
-s , --species Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
Optional arguments:
-st , --strain Strain/isolate name (e.g. Af293)
-w , --weights Gene predictors and weights
-m , --consensus-method Consensus model generation method [evm,gfftk] (default: gfftk)
-ps , --proteins Proteins to use for evidence
-ts , --transcripts Transcripts to use for evidence
-c , --cpus Number of cpus/threads to use (default: 2)
-mi , --max-intron Maximum intron length (default: 3000)
-hl , --header-len Max length for fasta headers (default: 100)
-l , --locus-tag Locus tag for genes, perhaps assigned by NCBI, eg. VC83 (default: FUN_)
-n , --numbering Specify start of gene numbering (default: 1)
-t , --tmpdir volume to write tmp files (default: /tmp)
Other arguments:
-h, --help show this help message and exit
--version show program's version number and exit
So you will also (for now) need to pass transcripts and proteins manually for evidence, so for your dataset something like this:
funannotate2 predict -p /path/to/genus_species.params.json -s "Genus species" -o out_folder \
-f sortedmasked.fa -ps $FUNANNOTATE_DB/uniprot_sprot.fasta \
-ts Trinity_GG.fasta --cpus N
So I'm curious to know if the new consensus caller gfftk consensus
results in a more appropriate number of gene models. It is a lot faster than EVM and should generate automatic weightings of the input sources as well.
Of course let me know if something isn't clear.
Hi Jon,
Thanks for the detailed guidance on this. One quick question is that there are several PASA gff files output in my previous training run, which one should I use as input for the funnotate2 train -t parameter?
In my output/training folder there is: pasa.step1.gff funannotate_train.pasa.gff3 (which points to transcript.alignments.gff)
within the output/training/pasa directory there are several more : [species]_pasa.assemblies.fasta.transdecoder.genome.gff3 [species]_pasa.assemblies.fasta.transdecoder.gff3 [species]_pasa.failed_blat_alignments.gff3 [species]_pasa.failed_custom_alignments.gff3 [species]_pasa.pasa_assemblies.gff3 [species]_pasa.valid_blat_alignments.gff3 [species]_pasa.valid_custom_alignments.gff3 blat.spliced_alignments.gff3 trinity.fasta.clean.transdecoder.gff3
A second note, I tried rerunning funannotate 1 with the parameters you suggested before (and increasing the max intron length). I ended up slightly more genes:
Progress: 2939 complete, 0 failed, 0 remaining
Source Weight Count Augustus 1 38426 Augustus HiQ 2 13306 GeneMark 1 78608 GlimmerHMM 0 106875 pasa 6 31714 snap 0 104593 Total - 373522 [Mar 08 01:49 AM]: 60,699 total gene models from EVM [Mar 08 01:49 AM]: Generating protein fasta files from 60,699 EVM models [Mar 08 01:50 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [Mar 08 01:50 AM]: Found 1,324 gene models to remove: 17 too short; 1 span gaps; 1,306 transposable elements [Mar 08 01:50 AM]: 59,375 gene models remaining [Mar 08 01:50 AM]: Predicting tRNAs [Mar 08 01:50 AM]: 7,773 tRNAscan models are valid (non-overlapping) [Mar 08 01:50 AM]: Generating GenBank tbl annotation file [Mar 08 01:53 AM]: Collecting final annotation files for 67,148 total gene models
I'm noticing there is quite a few incoming genes from GeneMark, whereas the others are all more on the reasonable side. I'm thinking I might also weight GeneMark 0 as well and run again? That leaves just PASA and Augustus
funannotate_train.pasa.gff3 should be the same as [species]_pasa.assemblies.fasta.transdecoder.genome.gff3 I think. Basically you want the transdecoder filtered set that has genomic coordinates.
Huh, wondering if EVM was actually passed the proper data, is that shown in the log file for EVM?
So based on the numbers of ab-initio gene calls all tools are predicting many more genes than 25k.
augustus --> 38426 + 13306 = 51732
genemark --> 78608
pasa --> 31714
So the PASA models are typically somewhere around 1/2 of what is normally in a genome (at least when we are talking about fungi - perhaps this is different in animals) -- effectively they are expressed genes, which I'd only expect to be max of maybe 60% of the genes in the genome -- unless you have the worlds best RNA-seq data from multiple cells/timepoints/etc.
I'm still interested to see what the funannotate2 train
module will do. You could also try a second training where you don't use the PASA gene models and try to use default BUSCO mediated training. Its possible that your PASA gene models are incomplete and shorter than they really are -- this could be then resulting in poor training I guess.
Hi @nextgenusfs --
Thanks for writing this incredible software! I've been experimenting with funannotate for the past week or so, and have it running on some new electric fish genomes. I've been meaning to try it out and compare results to MAKER, which has been our go-to pipeline for a while.
I've been following along with the suggestions at (https://funannotate.readthedocs.io/en/latest/tutorials.html) for Non-fungal genomes running the following:
funannotate train \ -i species_genome_cleaned.sortedmasked.fa \ -o species_train \ -l ${r1_files} \ -r ${r2_files} \ --pasa_db mysql \ --trinity Trinity-GG.fasta --max_intronlen 10000 \ --species "Myfish species" \ --cpus 16
Followed by:
funannotate predict \ -i species_genome_cleaned.sortedmasked.fa \ -o apteronotus_train -s "Myfish species" \ --repeats2evm \ --busco_seed_species zebrafish \ --busco_db actinopterygii \ --organism other \ --max_intronlen 10000 \ --species "Myfish species" \ --cpus 16
If I'm understanding the outputs correctly, the "Myfish_species.stats.json" file lists a whopping 58,036 genes-- we are expecting about half that (25k), in fact, what MAKER gives us on the identical genome, and what gene counts are listed for other closely related species.
I'm not sure where things went awry-- I am reusing the Genome Guided transcriptome that I pre-assembled for MAKER annotation rather than running it via Funannotate. I am using the same max_intronlen that I used for MAKER, however I'm not sure how strictly enforced that is in MAKER. I did noticed that from a related species that is annotated in NCBI, the maximum intron length is 500,000bp, which is obviously quite different. Finally, I did note a complaint in the output log at the Augustus Stage of training:
Augustus initial training results: Feature Specificity Sensitivity nucleotides 91.4% 84.3% exons 71.9% 68.8% genes 11.3% 9.6% [Mar 05 06:09 PM]: Accuracy seems low, you can try to improve by passing the --optimize_augustus option.
I'm contemplating rerunning using this setting to see if this dramatically reduces gene count, but curious if there have been reports of this previously, or if there are parameters I might tweak to get a more realistic gene number?
I am noticing in the "stats.json" file that a pretty small proportion of transcripts and proteins overlap with evidence ,and gene lengths seem to be much smaller than predicted by MAKER:
"annotation": { "genes": 58036, "common_name": 0, "mRNA": 50218, "tRNA": 7818, "ncRNA": 0, "rRNA": 0, "avg_gene_length": 3326.31, "transcript-level": { "CDS_transcripts": 50218, "CDS_five_utr": 0, "CDS_three_utr": 0, "CDS_no_utr": 50218, "CDS_five_three_utr": 0, "CDS_complete": 48843, "CDS_no-start": 687, "CDS_no-stop": 502, "CDS_no-start_no-stop": 186, "total_exons": 272309, "total_cds_exons": 272309, "multiple_exon_transcript": 44082, "single_exon_transcript": 6136, "avg_exon_length": 168.64, "avg_protein_length": 318.78, "functional": { "go_terms": 0, "interproscan": 0, "eggnog": 0, "pfam": 0, "cazyme": 0, "merops": 0, "busco": 0, "secretion": 0 }, "pct_exon_overlap_transcript_evidence": 33.02, "pct_exon_overlap_protein_evidence": 2.51 }
Thanks for writing this software, and for any thoughts you might have on improving our results!