nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Splice site validations failed in training/pasa step #956

Open melop opened 10 months ago

melop commented 10 months ago

I have been using funannotate to annotate fish genomes for a while. We found that funannotate performs very well when ISO-seq data are available (final busco > 98.5%), but doesn't work as well when only 2nd gen RNAseq data and homologous proteins are used (typically yielding a final busco of <91%). When only NGS rnaseq is available, I ran metaeuk using fish proteoms to obtain fake "transcripts" on the target genome and combined them with trinity transcripts. This initial joined "transcript set" had a BUSCO score of >99%. Nevertheless, many transcripts representing what appeared to be "complete BUSCOs" failed to pass the PASA validation during the training step, as can be seen in training/pasa/alignment.validations.output. Most of these had abnormal splice sites. After consulting with the authors of metaeuk, it turns out that metaeuk doesn't check for splice site validity when performing the alignments, so it all make sense. My question is, what would be the right program to use so that we can leverage homolgous protein data? Genewise or exonerate perhaps? My other question is, if a gene only receives support from protein alignment but no trinity transcripts, will funannotate just not see it as a valid gene?

Best regards, Ray

nextgenusfs commented 10 months ago

Hi @melop. I'm struggling with a similar issue on training some other higher eukaryotes as well -- ie after consensus model generation (EVM) I'm seeing lower BUSCO scores.

It's not entirely clear (yet) where the issue is and how to best mitigate it (its hard for me to find time outside of work). While it's not ready yet for public, I've been working on funannotate2 which will feature simplified code base, better training (hopefully), and simpler execution. What I've used for protein alignments in funannotate2 is miniprot as it is very fast and does a nice job with splice-sites.

But I was aware of metaeuk limitations and for training I've re-written BUSCO for training ab initial gene callers here: https://github.com/nextgenusfs/buscolite. As the current BUSCO release leans on metaeuk which isn't useful because the alignments can't be trusted (as you found out), but also practically I couldn't build conda environments with the dependencies in the busco conda recipe.

My other question is, if a gene only receives support from protein alignment but no trinity transcripts, will funannotate just not see it as a valid gene?

This isn't super clear because it is up to EvidenceModeler to choose the consensus models, sometimes it will actually create new models from the existing ab-initio evidence (Frankenstein model) while other times depending on the scoring it will choose an existing ab initial model. But the important part is that "evidence" alone will not result in a gene model -- the evidence is used to pick/build a consensus model. So you should still get gene predictions if you don't have transcript evidence in those regions. But it is also dependent on the weights given to EvidenceModeler.

hyphaltip commented 10 months ago

yeah metaeuk is not ideal - i would use augustus in BUSCO instead if you can. it can do further training on the prediction params.

I am not convinced our long read RNAseq handling works - the pacbio tools or minimap2 will be more appropriate.

funannotate runs exonerate for splice site polished protein alignemnts but I am hoping we can replace it with miniprot2 soon as that seems nearly as good.

I have a couple of genomes were EVM is undercalling genes in some examples and I want to better understand why that is too.

melop commented 10 months ago

Thanks a lot for the input. For long read RNAseq data, I used stringtie to obtain transcript assemblies first instead of feeding raw reads to funannotate. It appears that in the presence of long rnaseq reads, stringtie does a very nice job getting transcripts (as opposed to feeding it with NGS only data). Does funannotate use minimap2 for mapping assembled cDNA to the genome? The --aligners parameter appears to have three options: minimap2, blat and gmap. What are the defaults and how to pick? Thanks! I will try out miniprot.

melop commented 9 months ago

After trying miniprot2, I found that much fewer splice errors occurred, and BUSCO completeness has increased to 96.1% after the prediction step! But, after the updating step, BUSCO score again decreased to 95.6%. The update step showed: [Sep 09 06:05 AM]: Updated annotation complete:

Total Gene Models: 39,632 Total transcripts: 41,355 New Gene Models: 1,185 No Change: 21,712 Update UTRs: 11,907 Exons Changed: 4,765 Exons/CDS Changed: 63 Dropped Models: 95 CDS AED: 0.020 mRNA AED: 0.049

Is there a way to prevent the updating step from alterning exons and CDS, but instead only update UTRs and create new gene models?

Alternatively, is there a way to "merge" the gff3 with the prediction step, prioritizing, the latter CDS structure? Would a tool like GFFCompare helpful?

melop commented 9 months ago

Ok, sort of answered myself here. I used gffread to merge the gffs before and after update with the following command:

sp=orig grep -vP "^#" $origin_gff | \ sed 's/FUN_/'$sp'/g' | \ sed 's/novel_/'$sp'_/g' > orig.gff3 cat $updated_gff orig.gff3 > combined.gff gffread -g $fasta --t-adopt --keep-genes -V -F -M -d dup.info -K -w out.mrna.fa -x out.cds.fa -y out.prot.fa -S -o merged.gff combined.gff

And the resulting complete BUSCO became 97.3%. So for now this appears to be a temporary solution.