nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 85 forks source link

Evaluation of gene models from funannotate #923

Closed kushalsuryamohan closed 1 year ago

kushalsuryamohan commented 1 year ago

Hello, I am wondering if there is a systematic way to assess the quality of the annotations from funannotate predict? I'd like to do something similar to what I used to do with the outputs of MAKER wherein I'd create gene models at each AED and then assess the % of RNA-seq reads that mapped to gene models as well as BUSCO completeness scores for each AED score (0.1-1.0). Is this possible? I couldn't find a reference to AED in the outputs of funannotate predict unless I'm missing something very obvious.

Thanks for your help in advance!

hyphaltip commented 1 year ago

not sure - are you asking how AED is computed or are you wanting to get gene-level stats? the XXX.stats.json file in the annotate_results, update_results, and predict_results will report various summary stats, if you have done RNA-seq it should give a info like:

"pct_exon_overlap_transcript_evidence": 33.38,
"pct_exon_overlap_protein_evidence": 2.32

I would agree some other ways of providing summary stats would be useful and would welcome coding contributions..

there already exist tools which assess sensitivity and specificity of annotation along with evidence exist and operate on GFF / BED files which you can use from the produced annotations. not sure they achieve all of what you want but might be worth examining - the exons and RNAseq alignments are in the predict_misc folder

An older tool called EVAL is also out there

kushalsuryamohan commented 1 year ago

Thanks @hyphaltip! This is a good start. Let me dig into the json files and also look at the tools you mentioned. Closing this for now.

nextgenusfs commented 1 year ago

In terms of which gene models end up as the consensus prediction currently we rely on evidencemodeler to do this. It has an internal scoring system that honestly I'm not an expert on exactly how it works. But EVM can make hybrid type models do it's not always just a choice of which prediction to pick.