nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
311 stars 82 forks source link

MetaEuk for predicting genes #678

Open olekto opened 2 years ago

olekto commented 2 years ago

Hi Jon, I think you mentioned MetaEuk (https://github.com/soedinglab/metaeuk) to me at some point (as a comment in a failed research application). I've experimented a bit with it after that, and recently again.

When running it with proteins from OrthoDB10 (as described for ProtHint here: https://github.com/gatech-genemark/ProtHint#protein-database-preparation), and concatenated with SwissProt/UniProtKB, gives 90+ % complete genes with a bird and a fish with BUSCO5. 90+ % might not seem that much, but considering that this is quite easy and quick to run, it can give a good first impression or basis for further annotation.

One issue is that MetaEuk doesn't necessary give good exon-intron borders, since it only maps proteins and not transcripts to the genome. But maybe running Augustus in addition could do that.

My feeble attempts at bringing the MetaEuk predicted genes into Funannotation led to worse results than just MetaEuk on its own, but there is likely stuff that could be addressed.

Does any of you have experiences in running MetaEuk in addition to Funannotate? Would it be an idea to include it with Funannotate in so way as a rapid annotation pipeline?

Thank you.

Sincerely, Ole

nextgenusfs commented 2 years ago

Hi Ole,

So the newest version of BUSCO5 uses metaeuk now by default. I've played around a little bit with thinking about swapping the diamond/exonerate protein mapping with metaeuk -- but the problem is exactly what you've highlighted is that the intron-exon borders are not great -- since this is actually what we care most about for training de novo predictors I didn't pursue it as a training means. But you are correct that one could use the hits from metaeuk as seeds for something else, ie augustus or exonerate or something that will create a complete gene model. I've not used ProtHint before -- but seems similar to what funannotate does with the diamond/exonerate mapping of the protein evidence -- and it is a decent idea to use the ODB10 models. I tend to like to only provide gene models that are from real experimental evidence, hence uniprot/swissprot, but I could see value in using the ODB10 models.

One other way to generate a quick annotation would be just to liftover gene models from some close relative species, ie this won't be perfect, but actually many of the core genes from closely related organisms would likely liftover.

Jon