2x Expected Gene Count - Githubissues

Hi @nextgenusfs --

Thanks for writing this incredible software! I've been experimenting with funannotate for the past week or so, and have it running on some new electric fish genomes. I've been meaning to try it out and compare results to MAKER, which has been our go-to pipeline for a while.

I've been following along with the suggestions at (https://funannotate.readthedocs.io/en/latest/tutorials.html) for Non-fungal genomes running the following:

funannotate train \ -i species_genome_cleaned.sortedmasked.fa \ -o species_train \ -l ${r1_files} \ -r ${r2_files} \ --pasa_db mysql \ --trinity Trinity-GG.fasta --max_intronlen 10000 \ --species "Myfish species" \ --cpus 16

Followed by:

funannotate predict \ -i species_genome_cleaned.sortedmasked.fa \ -o apteronotus_train -s "Myfish species" \ --repeats2evm \ --busco_seed_species zebrafish \ --busco_db actinopterygii \ --organism other \ --max_intronlen 10000 \ --species "Myfish species" \ --cpus 16

If I'm understanding the outputs correctly, the "Myfish_species.stats.json" file lists a whopping 58,036 genes-- we are expecting about half that (25k), in fact, what MAKER gives us on the identical genome, and what gene counts are listed for other closely related species.

I'm not sure where things went awry-- I am reusing the Genome Guided transcriptome that I pre-assembled for MAKER annotation rather than running it via Funannotate. I am using the same max_intronlen that I used for MAKER, however I'm not sure how strictly enforced that is in MAKER. I did noticed that from a related species that is annotated in NCBI, the maximum intron length is 500,000bp, which is obviously quite different. Finally, I did note a complaint in the output log at the Augustus Stage of training:

Augustus initial training results: Feature Specificity Sensitivity nucleotides 91.4% 84.3% exons 71.9% 68.8% genes 11.3% 9.6% [Mar 05 06:09 PM]: Accuracy seems low, you can try to improve by passing the --optimize_augustus option.

I'm contemplating rerunning using this setting to see if this dramatically reduces gene count, but curious if there have been reports of this previously, or if there are parameters I might tweak to get a more realistic gene number?

I am noticing in the "stats.json" file that a pretty small proportion of transcripts and proteins overlap with evidence ,and gene lengths seem to be much smaller than predicted by MAKER:

"annotation": { "genes": 58036, "common_name": 0, "mRNA": 50218, "tRNA": 7818, "ncRNA": 0, "rRNA": 0, "avg_gene_length": 3326.31, "transcript-level": { "CDS_transcripts": 50218, "CDS_five_utr": 0, "CDS_three_utr": 0, "CDS_no_utr": 50218, "CDS_five_three_utr": 0, "CDS_complete": 48843, "CDS_no-start": 687, "CDS_no-stop": 502, "CDS_no-start_no-stop": 186, "total_exons": 272309, "total_cds_exons": 272309, "multiple_exon_transcript": 44082, "single_exon_transcript": 6136, "avg_exon_length": 168.64, "avg_protein_length": 318.78, "functional": { "go_terms": 0, "interproscan": 0, "eggnog": 0, "pfam": 0, "cazyme": 0, "merops": 0, "busco": 0, "secretion": 0 }, "pct_exon_overlap_transcript_evidence": 33.02, "pct_exon_overlap_protein_evidence": 2.51 }

Thanks for writing this software, and for any thoughts you might have on improving our results!

What seems to happen for larger eukaryotes is that codingquarry and snap don't work very well. Can you check the logs and see what the number of genes predicted for each took prior to EVM is? Might need to turn off a few of the ab initio models. Also, is your assembly haploid?

Yes, it's a haploid assembly. I took extra care also by running funnotate clean and funnotate sort prior to the pipeline if that's helpful

I think the portion of the log you are asking for is here (so many logs!)

[Mar 05 06:09 PM]: Running Augustus gene prediction using apteronotus_albifrons parameters
[Mar 05 07:15 PM]: 51,732 predictions from Augustus
     Progress: 0 complete, 0 failed, 1471 remaining        ^M     Progress: 0 complete, 0 failed, 1471 remaining        ^M     Progress: 0 complete,$
[Mar 05 07:15 PM]: Pulling out high quality Augustus predictions
[Mar 05 07:15 PM]: Found 13,306 high quality predictions from Augustus (>90% exon evidence)
[Mar 05 07:15 PM]: Running SNAP gene prediction, using training data: apteronotus_train/predict_misc/final_training_models.gff3
[Mar 05 07:58 PM]: 104,593 predictions from SNAP
[Mar 05 07:58 PM]: Running GlimmerHMM gene prediction, using training data: apteronotus_train/predict_misc/final_training_models.gff3
[Mar 05 09:05 PM]: 106,875 predictions from GlimmerHMM
[Mar 05 09:05 PM]: Summary of gene models passed to EVM (weights):
[Mar 05 09:05 PM]: EVM: partitioning input to ~ 35 genes per partition using min 1500 bp interval
[Mar 06 12:41 AM]: Converting to GFF3 and collecting all EVM results
  Augustus   1        38426
  Augustus HiQ   2        13306
  GeneMark   1        78608
  GlimmerHMM     1        106875
  pasa           6        31714
  snap           1        104593
  Total          -        373522
[Mar 06 12:41 AM]: 51,654 total gene models from EVM
[Mar 06 12:41 AM]: Generating protein fasta files from 51,654 EVM models
[Mar 06 12:42 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[Mar 06 12:42 AM]: Found 1,420 gene models to remove: 33 too short; 0 span gaps; 1,387 transposable elements
[Mar 06 12:42 AM]: 50,234 gene models remaining
[Mar 06 12:42 AM]: Predicting tRNAs
[Mar 06 01:16 AM]: 7,818 tRNAscan models are valid (non-overlapping)

So it looks like snap and glimmerhmm aren't performing very well. So you can turn those off by passing --weights snap:0 glimmerhmm:0 to your funannotate predict command. It should re-use existing data and re-run EVM with the new weights, it will effectively ignore the snap and glimmerhmm models.

If you are more "adventurous" and would perhaps tolerate a few bugs, I've been slowly working on funannotate2 https://github.com/nextgenusfs/funannotate2. The training module and predict are mostly complete. I've re-written mostly everything to hopefully work better and have also written a new consensus gene calling method. Largely I'm trying to reduce dependencies and reduce complexities, funannotate tries to do too many things and install becomes difficult. I'd be interested to know if it is working better or not. Let me know if you'd like to give it a try and I can give you some instructions.

OK, I'll give those settings a shot and see what happens.

But yes, I'm feeling adventurous (adventure is my middle name! ha-- it's actually Raymond). Send me some info, and I'll be happy to be a test pilot. I totally understand the Rube Goldberg nature of things, and appreciate your efforts to simplify. If I can be of any assistance, let me know!

Okay, so for f2 you can do this to install with mamba/conda/pip:

# install external dependencies
mamba create -n funannotate2 "python<=3.10" biopython "evidencemodeler>=2" minimap2 miniprot snap "augustus==3.5.0" glimmerhmm diamond trnascan-se table2asn

# activate the environment
conda activate funannotate2

# install latest from git
python -m pip install git+https://github.com/nextgenusfs/funannotate2.git

# need to also set the env variable pointing to f1 FUNANNOTATE_DB
export FUNANNOTATE_DB=/path/to/existing/funannotate_db/folder

You could probably also just install it in the same environment as your existing funannotate installation, and then would just need to add miniprot and table2asn via conda/mamba.

The help menus should all look and feel familiar

$ funannotate2
usage: funannotate2 [-h] [--version] {clean,train,predict} ...

Funannotate: eukaryotic genome annotation pipeline

Commands:
  clean       Find and remove duplicated contigs, sort by size, rename headers.
  train       Train ab intio gene prediction algorithms.
  predict     Predict primary gene models in a eukaryotic genome.

Help:
  -h, --help  Show this help message and exit
  --version   show program's version number and exit

One difference is that funannotate2 train will run all of the training steps and it takes an optional -t, --training-set in GFF3 format -- if a training-set is not passed then it will run buscolite to generate a training set (the tool will auto-detect the appropriate BUSCO models and download them based on the taxonomy you provide via the -s, --species parameter, so don't put a dummy value in there. Then those gene models will be used to train all of the ab-initio predictors. The output of train is then a JSON file that contains all of the training parameters that will be used by funannotate2 predict.

$ funannotate2 train
usage: funannotate2 train -f  -s  -o  [-t] [--strain] [--cpus] [--optimize-augustus] [--header-len] [-h] [--version]

Gene model training from automated training of ab initio gene prediction algorithms:

Required arguments:
  -f , --fasta          genome in FASTA format
  -s , --species        Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
  -o , --out            Output folder name

Optional arguments:
  -t , --training-set   Training set to use in GFF3 format
  --strain              Strain/isolate name
  --cpus                Number of CPUs to use (default: 2)
  --optimize-augustus   Run Augustus mediated optimized training (not recommended) (default: False)
  --header-len          Max length for fasta headers (default: 100)

Other arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

Example of a training parameters JSON output --> written to output/train_results/<species>.params.json:

{
  "name": "ophidiomyces_ophidiicola",
  "species": "Ophidiomyces ophidiicola",
  "taxonomy": {
    "superkingdom": "Eukaryota",
    "kingdom": "Fungi",
    "phylum": "Ascomycota",
    "class": "Eurotiomycetes",
    "order": "Onygenales",
    "family": "Onygenaceae",
    "genus": "Ophidiomyces",
    "species": "Ophidiomyces ophidiicola"
  },
  "busco-lineage": "/usr/local/share/funannotate/onygenales_odb10",
  "assembly": {
    "n_contigs": 115,
    "size": 21919615,
    "n50": 506472,
    "n90": 122144,
    "l50": 12,
    "l90": 42,
    "avg_length": 190605
  },
  "abinitio": {
    "augustus": {
      "location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/augustus",
      "species": "7975c7f5-2f76-445e-8880-7c021e5cb032",
      "train_results": {
        "tool": "augustus",
        "model": "7975c7f5-2f76-445e-8880-7c021e5cb032",
        "n_test_genes": 200,
        "ref_genes_found": 198,
        "ref_genes_missed": 2,
        "extra_query_genes": 120,
        "average_aed": 0.11175562640178095,
        "nucleotide_sensitivity": 0.9075853572557586,
        "nucleotide_precision": 0.9036721113763718,
        "exon_sensitivity": 0.6288659793814433,
        "exon_precision": 0.6606332098600142,
        "gene_sensitivity": 0.9753086419753086,
        "gene_precision": 0.3969849246231156
      },
      "training_set": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/training-models.train.gff3"
    },
    "glimmerhmm": {
      "location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/glimmerhmm/train",
      "train_results": {
        "tool": "glimmerhmm",
        "model": "train",
        "n_test_genes": 200,
        "ref_genes_found": 199,
        "ref_genes_missed": 1,
        "extra_query_genes": 178,
        "average_aed": 0.14522900680209533,
        "nucleotide_sensitivity": 0.8494452668122087,
        "nucleotide_precision": 0.9028987277614996,
        "exon_sensitivity": 0.5104712041884817,
        "exon_precision": 0.5557903266018449,
        "gene_sensitivity": 0.9821428571428571,
        "gene_precision": 0.23605150214592274
      },
      "training_set": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/training-models.train.gff3"
    },
    "snap": {
      "location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/snap/snap-trained.hmm",
      "train_results": {
        "tool": "snap",
        "model": "snap-trained.hmm",
        "n_test_genes": 200,
        "ref_genes_found": 195,
        "ref_genes_missed": 5,
        "extra_query_genes": 127,
        "average_aed": 0.18329479111453992,
        "nucleotide_sensitivity": 0.8492347150671552,
        "nucleotide_precision": 0.8563979566847235,
        "exon_sensitivity": 0.31283422459893045,
        "exon_precision": 0.2723601561836857,
        "gene_sensitivity": 0.16666666666666666,
        "gene_precision": 0.0078125
      },
      "training_set": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/training-models.train.gff3"
    },
    "genemark": {
      "location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/genemark/gmhmm.mod",
      "train_results": {
        "tool": "genemark",
        "model": "gmhmm.mod",
        "n_test_genes": 200,
        "ref_genes_found": 200,
        "ref_genes_missed": 0,
        "extra_query_genes": 196,
        "average_aed": 0.09756846438881216,
        "nucleotide_sensitivity": 0.898724540750394,
        "nucleotide_precision": 0.9344481427857675,
        "exon_sensitivity": 0.6243654822335025,
        "exon_precision": 0.6643433977190323,
        "gene_sensitivity": 1.0,
        "gene_precision": 0.2924187725631769
      },
      "training_set": "self training"
    }
  },
  "date": "2023-10-26 12:03:48.138710",
  "user": "jon"
}

So with your data and since you have already generated RNA-seq mediated Trinity/PASA, I would pass the PASA GFF3 file to -t, --training-set. After training is complete then you can run funannotate2 predict which again should look mostly familiar -- currently you will need to pass the full path to the training JSON params file (eventually I'll make this "easy/automatic" as before). Also for comparison sake, the latest of funannotate1 can take this params JSON file as input to predict as well. But what is new in funannotate2 predict is a more efficient evidence alignment as well as a new consensus gene calling program.

$ funannotate2 predict
usage: funannotate2 predict -f  -o  -p  -s  [-st] [-w] [-m] [-ps] [-ts] [-c] [-mi] [-hl] [-l] [-n] [-pt] [-t] [-h] [--version]

Gene model prediction from automated training of ab initio gene prediction algorithms:

Required arguments:
  -f , --fasta                   genome in FASTA format (softmasked repeats)
  -o , --out                     Output folder name
  -p , --params , --pretrained   Params.json 
  -s , --species                 Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"

Optional arguments:
  -st , --strain                 Strain/isolate name (e.g. Af293)
  -w , --weights                 Gene predictors and weights
  -m , --consensus-method        Consensus model generation method [evm,gfftk] (default: gfftk)
  -ps , --proteins               Proteins to use for evidence
  -ts , --transcripts            Transcripts to use for evidence
  -c , --cpus                    Number of cpus/threads to use (default: 2)
  -mi , --max-intron             Maximum intron length (default: 3000)
  -hl , --header-len             Max length for fasta headers (default: 100)
  -l , --locus-tag               Locus tag for genes, perhaps assigned by NCBI, eg. VC83 (default: FUN_)
  -n , --numbering               Specify start of gene numbering (default: 1)
  -t , --tmpdir                  volume to write tmp files (default: /tmp)

Other arguments:
  -h, --help                     show this help message and exit
  --version                      show program's version number and exit

So you will also (for now) need to pass transcripts and proteins manually for evidence, so for your dataset something like this:

funannotate2 predict -p /path/to/genus_species.params.json -s "Genus species" -o out_folder \
    -f sortedmasked.fa -ps $FUNANNOTATE_DB/uniprot_sprot.fasta \
    -ts Trinity_GG.fasta --cpus N

So I'm curious to know if the new consensus caller gfftk consensus results in a more appropriate number of gene models. It is a lot faster than EVM and should generate automatic weightings of the input sources as well.

Of course let me know if something isn't clear.

Hi Jon,

Thanks for the detailed guidance on this. One quick question is that there are several PASA gff files output in my previous training run, which one should I use as input for the funnotate2 train -t parameter?

In my output/training folder there is: pasa.step1.gff funannotate_train.pasa.gff3 (which points to transcript.alignments.gff)

within the output/training/pasa directory there are several more : [species]_pasa.assemblies.fasta.transdecoder.genome.gff3 [species]_pasa.assemblies.fasta.transdecoder.gff3 [species]_pasa.failed_blat_alignments.gff3 [species]_pasa.failed_custom_alignments.gff3 [species]_pasa.pasa_assemblies.gff3 [species]_pasa.valid_blat_alignments.gff3 [species]_pasa.valid_custom_alignments.gff3 blat.spliced_alignments.gff3 trinity.fasta.clean.transdecoder.gff3

A second note, I tried rerunning funannotate 1 with the parameters you suggested before (and increasing the max intron length). I ended up slightly more genes:

 Progress: 2939 complete, 0 failed, 0 remaining

Source Weight Count Augustus 1 38426 Augustus HiQ 2 13306 GeneMark 1 78608 GlimmerHMM 0 106875 pasa 6 31714 snap 0 104593 Total - 373522 [Mar 08 01:49 AM]: 60,699 total gene models from EVM [Mar 08 01:49 AM]: Generating protein fasta files from 60,699 EVM models [Mar 08 01:50 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [Mar 08 01:50 AM]: Found 1,324 gene models to remove: 17 too short; 1 span gaps; 1,306 transposable elements [Mar 08 01:50 AM]: 59,375 gene models remaining [Mar 08 01:50 AM]: Predicting tRNAs [Mar 08 01:50 AM]: 7,773 tRNAscan models are valid (non-overlapping) [Mar 08 01:50 AM]: Generating GenBank tbl annotation file [Mar 08 01:53 AM]: Collecting final annotation files for 67,148 total gene models

I'm noticing there is quite a few incoming genes from GeneMark, whereas the others are all more on the reasonable side. I'm thinking I might also weight GeneMark 0 as well and run again? That leaves just PASA and Augustus

funannotate_train.pasa.gff3 should be the same as [species]_pasa.assemblies.fasta.transdecoder.genome.gff3 I think. Basically you want the transdecoder filtered set that has genomic coordinates.

Huh, wondering if EVM was actually passed the proper data, is that shown in the log file for EVM?

So based on the numbers of ab-initio gene calls all tools are predicting many more genes than 25k.

augustus --> 38426 + 13306 = 51732
genemark --> 78608
pasa -->  31714

So the PASA models are typically somewhere around 1/2 of what is normally in a genome (at least when we are talking about fungi - perhaps this is different in animals) -- effectively they are expressed genes, which I'd only expect to be max of maybe 60% of the genes in the genome -- unless you have the worlds best RNA-seq data from multiple cells/timepoints/etc.

I'm still interested to see what the funannotate2 train module will do. You could also try a second training where you don't use the PASA gene models and try to use default BUSCO mediated training. Its possible that your PASA gene models are incomplete and shorter than they really are -- this could be then resulting in poor training I guess.

nextgenusfs / funannotate

2x Expected Gene Count #1010