Open melop opened 10 months ago
Hi @melop. I'm struggling with a similar issue on training some other higher eukaryotes as well -- ie after consensus model generation (EVM) I'm seeing lower BUSCO scores.
It's not entirely clear (yet) where the issue is and how to best mitigate it (its hard for me to find time outside of work). While it's not ready yet for public, I've been working on funannotate2
which will feature simplified code base, better training (hopefully), and simpler execution. What I've used for protein alignments in funannotate2
is miniprot
as it is very fast and does a nice job with splice-sites.
But I was aware of metaeuk
limitations and for training I've re-written BUSCO for training ab initial gene callers here: https://github.com/nextgenusfs/buscolite. As the current BUSCO release leans on metaeuk
which isn't useful because the alignments can't be trusted (as you found out), but also practically I couldn't build conda environments with the dependencies in the busco conda recipe.
My other question is, if a gene only receives support from protein alignment but no trinity transcripts, will funannotate just not see it as a valid gene?
This isn't super clear because it is up to EvidenceModeler to choose the consensus models, sometimes it will actually create new models from the existing ab-initio evidence (Frankenstein model) while other times depending on the scoring it will choose an existing ab initial model. But the important part is that "evidence" alone will not result in a gene model -- the evidence is used to pick/build a consensus model. So you should still get gene predictions if you don't have transcript evidence in those regions. But it is also dependent on the weights given to EvidenceModeler.
yeah metaeuk is not ideal - i would use augustus in BUSCO instead if you can. it can do further training on the prediction params.
I am not convinced our long read RNAseq handling works - the pacbio tools or minimap2 will be more appropriate.
funannotate runs exonerate for splice site polished protein alignemnts but I am hoping we can replace it with miniprot2 soon as that seems nearly as good.
I have a couple of genomes were EVM is undercalling genes in some examples and I want to better understand why that is too.
Thanks a lot for the input. For long read RNAseq data, I used stringtie to obtain transcript assemblies first instead of feeding raw reads to funannotate. It appears that in the presence of long rnaseq reads, stringtie does a very nice job getting transcripts (as opposed to feeding it with NGS only data). Does funannotate use minimap2 for mapping assembled cDNA to the genome? The --aligners parameter appears to have three options: minimap2, blat and gmap. What are the defaults and how to pick? Thanks! I will try out miniprot.
Is there a way to prevent the updating step from alterning exons and CDS, but instead only update UTRs and create new gene models?
Alternatively, is there a way to "merge" the gff3 with the prediction step, prioritizing, the latter CDS structure? Would a tool like GFFCompare helpful?
Ok, sort of answered myself here. I used gffread to merge the gffs before and after update with the following command:
sp=orig grep -vP "^#" $origin_gff | \ sed 's/FUN_/'$sp'/g' | \ sed 's/novel_/'$sp'_/g' > orig.gff3 cat $updated_gff orig.gff3 > combined.gff gffread -g $fasta --t-adopt --keep-genes -V -F -M -d dup.info -K -w out.mrna.fa -x out.cds.fa -y out.prot.fa -S -o merged.gff combined.gff
And the resulting complete BUSCO became 97.3%. So for now this appears to be a temporary solution.
I have been using funannotate to annotate fish genomes for a while. We found that funannotate performs very well when ISO-seq data are available (final busco > 98.5%), but doesn't work as well when only 2nd gen RNAseq data and homologous proteins are used (typically yielding a final busco of <91%). When only NGS rnaseq is available, I ran metaeuk using fish proteoms to obtain fake "transcripts" on the target genome and combined them with trinity transcripts. This initial joined "transcript set" had a BUSCO score of >99%. Nevertheless, many transcripts representing what appeared to be "complete BUSCOs" failed to pass the PASA validation during the training step, as can be seen in training/pasa/alignment.validations.output. Most of these had abnormal splice sites. After consulting with the authors of metaeuk, it turns out that metaeuk doesn't check for splice site validity when performing the alignments, so it all make sense. My question is, what would be the right program to use so that we can leverage homolgous protein data? Genewise or exonerate perhaps? My other question is, if a gene only receives support from protein alignment but no trinity transcripts, will funannotate just not see it as a valid gene?
Best regards, Ray