Closed EarlyEvol closed 4 years ago
Hi Earl --
Yes you bring up some good points -- perhaps we can be doing a little bit more with the PASA models.
Just so we are on the same page, let me just start what is happening with the PASA models from funannotate train
.
The last step of the PASA pipeline is to run TransDecoder (https://github.com/PASApipeline/PASApipeline/wiki/PASA_abinitio_training_sets). This is to filter out the de novo transcripts that essentially aren't valid gene models (https://github.com/TransDecoder/TransDecoder/wiki).
What funannotate train
does with those results is to map the RNA-seq data back to these models using Kallisto and chooses the "best" model at each locus (ie which transcript at a locus has the most reads mapped or TPM (transcripts per million)). The logic here might fall apart if genes are highly tissue specific or over represented in a portion of combined datasets, etc.
The idea of selecting the best model here to move onto training the ab initio predictors is for a couple of reasons -- at this point I don't want alternative transcripts as we will add those back with funannotate update
. Along similar lines we don't want to use gene models that are similar to one another for training the other predictors.
So then what is happening with these models for training in funannotate predict
?
funannotate predict
and pass --pasa_gff
(either directly or if detected by previous run of train), it will pass all of those models to EVM and give it a default weight of 6. So this means it takes all models that pass TransDecoder and passes those to EVM.--pasa_gff
is used as input to generate a subset/filtered training set for Augustus, GlimmerHMM, and snap. There are a few steps in this process, but the idea is to filter for multiple exons, high intron/exon overlap with evidences provided, and ensure that no two gene models are within 80% pident at the protein level. This smaller subset then is used only for training.Some potential areas where you could give EVM more PASA models.
funannotate predict
is setup to basically yield a single consensus gene model per locus, so if only alternative transcripts are in this set, I'm not sure they will help much. Might need to look in detail at some of these to make sure that is actually what is happening. Per the question "All the other ab initio predictors come up with about 100K gene models." Maybe this means that the training isn't working great -- or as you said something more intrinsic to those tools. How many gene models are you expecting? Are the models from GlimmerHMM/snap much shorter? How many of your PASA gene models are being used for training?
Thanks for the detailed answer. Disclaimer: I'm still on version 1.7 It looks like only the training pasa models are making it to EVM. Does 1.7 behave differently?
Here are some logs:
Training:
[12:20 AM]: Mapping reads using pseudoalignment in Kallisto
[12:23 AM]: Parsing expression value results. Keeping best transcript at each locus.
[12:35 AM]: Wrote 22,835 PASA gene models
Predict:
[08/01/20 01:53:50]: 22,835 PASA genes; 11,203 have multi-CDS; 4,409 from filterGeneMark; 3,545 have multi-CDS
[08/01/20 01:53:50]: filterGeneMark GTF filter set to True; require genes with multiple CDS set to True
[08/01/20 01:53:57]: diamond makedb --in augustus.training.proteins.fa --db aug_training.dmnd
[08/01/20 01:53:57]: diamond blastp --query augustus.training.proteins.fa --db aug_training.dmnd --more-sensitive -o aug.blast.txt -f 6 qseqid sseqid pident --query-cover 80 --subject-cover 80 --id 80 --no-self-hits
[08/01/20 01:54:01]: 138 models fail blast identity threshold
[08/01/20 01:54:11]: 19,428 models will be ignored for training Augustus
[08/01/20 01:54:30]: 3,297 of 22,835 models pass training parameters
...
[08/01/20 04:08:49]: Prediction sources: [u'Augustus', u'HiQ', u'GeneMark', u'GlimmerHMM', u'pasa']
[08/01/20 04:08:54]: Summary of gene models: {u'GeneMark': 98970, u'HiQ': 7315, u'pasa': 3297, u'Augustus': 29229, u'GlimmerHMM': 123353, u'total': 262164}
[08/01/20 04:08:54]: EVM Weights: {u'GeneMark': 1, u'HiQ': 5, u'pasa': 10, u'proteins': 1, u'Augustus': 3, u'GlimmerHMM': 1, u'transcripts': 1}
[08/01/20 04:08:54]: Summary of gene models passed to EVM (weights):
[08/01/20 04:08:54]: Launching EVM via funannotate-runEVM.py
The gene_prediction file only has 3297 pasa models.
[earlm1@login3 predict_misc]$ cat gene_predictions.gff3 | grep pasa | grep gene | wc -l
3297
If this is just the old behavior and the new version changes it, let me know and I'll just suck it up and update.
As for the other predictors: I'm expecting something like 15K loci although there are no close relatives sequenced. Yeah the genmark/glimmer/snap predictions are primarily fragmented and there are a ton out in heterochromatin desert with no functional annotations, so I'm guessing they are erroneous. I haven't spent much time seeing if I can get glimmer or snap to perform better. Augustus seem to look really good for everything that has RNA. I might take a stab at it for my next annotations for this group.
Thanks, Earl
I will have to look at the code in 1.7.4 -- but if that's what it is doing than sounds like a bug as that was not the intention. Let me ensure it's working the way i said it was in master. I'm trying to get a new release out shortly just testing on a few larger genomes before tagging a new release.
I don't see any obvious changes in v1.7 or 1.7.4 as it relates to processing the PASA input GFF3 files and passing to EVM.... so this must mean its still a bug. I'll dig into it now and see if I can figure it out.
I just looked at logs from 2 other genomes. EVM got passed the larger "best" PASA models, not just the augustus training set. Must be something weird. I'm completely rerunning predict now. Thanks for spending time on this.
I think I found the bug in the current code -- pushing a change now. It might have only shown up on a re-run of predict.
NIce! glad you found it. That makes sense, I first attempted to run train and predict together. It got killed at some point within predict and I restarted. I think I deleted the incomplete intermediates, but left most of predict_misc.
Let me know if a fresh run results in the same problem -- if it does lets look specifically at the version you are running and perhaps I can just tell you which line to comment out if you don't want to upgrade. Generally I would recommend upgrading always -- but I realize you want to be consistent especially if doing some comparisons between genomes.
Jon, Sorry for the delayed response, I got distracted trying to move stuff over to a new compute cluster. On the rerun, predict with no intermediates, still just sent the PASA training models to EVM. Not sure what the difference was for this assembly vs. the others I have annotated. I checked out your recent commits and it looked like I needed to comment out lines 1298 and 1299. That did the trick! Thanks for checking this out for me! Earl
Great Earl -- thanks for following up. Yes it looks like those two lines on a re-run were basically over-writing the input PASA to the PASA-training set.
Hi Jon,
Unfortunately I decided to stare at my annotations again...and I think I could do better.
I was just wondering if there was a specific reason not to include PASA models beyond the ones that pass training filtering. For my wasps, it looks like PASA is almost always the best model. I think my goal should probably be to take PASA models and add ab initio models for loci without RNA evidence. In particular, the augustus models tend to break on long or alternatively spliced models. All the other ab initio predictors come up with about 100K gene models. Not sure whats going on, perhaps they dont deal well with the 30% GC content. Is there a reason not to provide the full complement of PASA models to EVM with another "OTHER_PREDICTION" category? It seams like the main problem would be that I will miss extending some PASA models with longer ab initio models because of assigned weights. Am I thinking about this right?
Best, Earl