nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Filter gene models based on an AED cutoff during or after update step? #980

Closed nam-hoang closed 7 months ago

nam-hoang commented 7 months ago

Hi fun_team,

I completed the funannotate update step with a quite decent result.

average CDS AED: 0.005 average mRNA AED: 0.064

Here, I wonder how should I go about if I want to filter out gene models with high AED (e.g., >0.5) during or after update step? Would you please be able to give some advice on this? I found there is a length filtering but could not find any steps related to AED-quality filtering in the update.py script.

Thank you very much. Nam

nextgenusfs commented 7 months ago

Filtering with AED isn't really possible in funannotate. These numbers here in the update script are just telling you the average edit distance from the original gene model from predict versus the PASA mediated gene model from update. The consensus gene calls are called in the predict function by EvidenceModeler -- so to change the consensus models you'd need to update the weights file that effectively provides EVM with scoring/ranking of input ab initio gene models -- but filtering really isn't something you can do with EVM very easily. Why are you trying to impose a filter on the consensus models? Are there more gene models than expected?

nam-hoang commented 7 months ago

Thanks Jon for the very quick response.

Yes, it seems like I got about 5000 - 8,000 genes more than I would expect for this plant genome I am working on. When I run predict: I got 53,780 genes with a BUSCO score of C:98.4%[S:84.1%,D:14.3%],F:0.7%,M:0.9%,n:1614 After update, it changed to 54,462 genes with a BUSCO score of C:97.9%[S:84.0%,D:13.9%],F:1.5%,M:0.6%,n:1614, somehow increased fragmented, but reduced missing BUSCOs. Despite that, the result is still acceptable.

As you have clarified, the AED provided by funannotate is not the same as that from the MAKER pipeline which measures the congruence of each annotation with its overlapping evidence. So it does not make sense to use this to filter gene models, e.g., like quality_filter.pl of MAKER.

In your experience, what would be an effective approach to filter out low-quality gene models if you genuinely wanted to do so? For this run, I used the weights as follows: --weights augustus:2 pasa:10 snap:1 transcripts:6 proteins:3 GeneMark:1 GlimmerHMM:1.

Thank you very much. Nam

nextgenusfs commented 7 months ago

You could play around with the weights to see if they have an effect, often snap doesn't perform well, ie maybe a first test would be to just make snap:0 and keep the rest the same, re-run predict into a new output folder and see if the gene models change. The one thing to note is that in EvidenceModeler (doing the consensus calls) is that gene models are not derived from any of the evidence, they must be part of an abinitio gene call in order to be evaluated -- the evidence (transcript and protein alignments) direct EVM to choose one of those models and then EVM occasionally decides to make "Frankenstein" models if the evidence supports merging two or more ab initio models. So weighting the alignments just functions to have it weight/choose ab initio models that better align to the source of the evidence (ie in your weights here transcript alignments would have 2X the weight of protein alignments). That seems somewhat reasonable.

The PASA method in update is to run through two rounds of what the devs called "comparison" mode, which effectively re-evaluates the gene models within the context of the alignments in the PASA database. I do often see it call "new" gene models that non of the abinitio predictors predicted, this is probably real/good as they are based on actual alignments.

nam-hoang commented 7 months ago

@nextgenusfs Thanks Jon for your great advice! Will try to run different weights to see how it goes! Have a great weekend! :D