Open kushalsuryamohan opened 1 year ago
I've found this to happen in some large genomes I was working with too - though i cannot totally link it to specific issues - I suspect EVM is to blame if there is a lot of variable conflicts in the predictions, first thing to try is to rerun but set the weights for glimmerhmm to 0 given it is 3x the number of calls - I suspect lots of genes are getting split perhaps.... but I don't totally know for sure but it would be fast to drop it and have it re-run. --weights codingquarry:1 glimmerhmm:0
what I've done in the past is load the GFF in jbrowse2 and compare the individual gene tracks with the final EVM set and where the TEs are.
Another is to compare the BUSCO genome calls to the BUSCO protein calls and see what are examples of missing BUSCO genes in the annotation but are clearly in the BUSCO genome-based gene calls.
Another thought is to just try BRAKER3 and see how it compares in the predictions.
@kushalsuryamohan I don't know the problem/issue off the top of my head. I'm interested in understanding how/why the training is so different with the different ab initio tools (ie why is GlimmerHMM and snap so bad and GeneMark is 1/2 of what it should be?). Would you be interesting/willing to share the data privately and I can try to look? I would just need the assembly and your PASA GFF3 (TransDecoder filtered would be fine). Mostly I'm working on funannotate2
which is effectively a code clean up, reduce dependencies, and simplify as much as possible (one route is to replace EVM with a consensus gene model tool). I don't have this completely working yet, but if I need to make design changes for training it would be good to know sooner rather than later. If you would like to share data, can send to my email nextgenusfs at gmail.
Hi Jon @nextgenusfs, Sure I'd be happy to share this (although it is unpublished data so I'd appreciate it being kept private). How do you propose I share the data?
Thanks!
Yes, will keep private -- just want to see if I can figure out why this is happening and if we can fix. Is it possible to email me compressed version, or possibly a shared link from somewhere?
Will do. Can you share your email address, please?
nextgenusfs at gmail
Hi @hyphaltip and @nextgenusfs , out of curiosity and to dig deeper into this, I extracted transcripts from the augustus gff3 (from the predict_misc directory) and ran BUSCO. Here are the BUSCO results (metazoa):
|Results from dataset metazoa_odb10 | -------------------------------------------------- |C:98.2%[S:89.4%,D:8.8%],F:1.0%,M:0.8%,n:954 | |937 Complete BUSCOs (C) | |853 Complete and single-copy BUSCOs (S) | |84 Complete and duplicated BUSCOs (D) | |10 Fragmented BUSCOs (F) | |7 Missing BUSCOs (M) | |954 Total BUSCO groups searched | --------------------------------------------------
I, like you, suspect that this has to do with EVM.
To recap, here are the weights passed to EVM:
Augustus 1 19456 Augustus HiQ 2 7523
GeneMark 1 105111 GlimmerHMM 1 331206 pasa 6 21812 snap 1 197409
Can I pass --weights codingquarry:1 glimmerhmm:0 snap:0 and increase weights to Augustus to 2 and Augustus HiQ to something higher? Any suggestions?
@nextgenusfs I am uploading the data per your request. Will send you an email once the data has been uploaded for you to take a look at as well. I appreciate the help very much.
Hi @nextgenusfs, I've uploaded the genome and PASA GFF3 and sent you an email with access details. Can you also see my previous post above? It might help you troubleshoot this as well. Thanks!
For vertebrate genomes, I set weights for all gene predictors to 0 except for augustus. I never got the other ones to perform (including the non-free genemark) well and adding them makes the final prediction much worse. Your results confirmed my experience indeed.
For vertebrate genomes, I set weights for all gene predictors to 0 except for augustus. I never got the other ones to perform (including the non-free genemark) well and adding them makes the final prediction much worse. Your results confirmed my experience indeed.
Hello, I am trying to annotate a reptilian genome that has a fairly good scaffolded assembly (N50 = ~215MB; 1,966 scaffolds). Genome busco completeness is ~93%
I have RNA-seq data from a pooled set of tissues. Trinity-derived de novo transcriptome models have a slightly lower BUSCO score of ~61% (this suggests a lack of diversity in the libraries and/or poor RNA-seq data quality). Despite this, I assumed that a combination of Augustus + Pasa could provide a good annotated gene set. However, the final set of gene predictions have a ~30% BUSCO completeness score.
I'm a little confused as to how this might be happening and would appreciate some guidance on improving this.
Here are the step-by-step commands I used to get to this point.
Here is the log of training:
Here's the BUSCO log for the trinity models:
And for PASA models:
Below is the log of predict:
The final genome annotation BUSCO as I mentioned is ~33%:
Any help/insights here will be much appreciated!