nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 84 forks source link

only pasa training models get sent to EVM during predict #474

Closed EarlyEvol closed 4 years ago

EarlyEvol commented 4 years ago

Hi Jon,

Unfortunately I decided to stare at my annotations again...and I think I could do better.

I was just wondering if there was a specific reason not to include PASA models beyond the ones that pass training filtering. For my wasps, it looks like PASA is almost always the best model. I think my goal should probably be to take PASA models and add ab initio models for loci without RNA evidence. In particular, the augustus models tend to break on long or alternatively spliced models. All the other ab initio predictors come up with about 100K gene models. Not sure whats going on, perhaps they dont deal well with the 30% GC content. Is there a reason not to provide the full complement of PASA models to EVM with another "OTHER_PREDICTION" category? It seams like the main problem would be that I will miss extending some PASA models with longer ab initio models because of assigned weights. Am I thinking about this right?

Best, Earl

nextgenusfs commented 4 years ago

Hi Earl --

Yes you bring up some good points -- perhaps we can be doing a little bit more with the PASA models.

Just so we are on the same page, let me just start what is happening with the PASA models from funannotate train.

  1. The last step of the PASA pipeline is to run TransDecoder (https://github.com/PASApipeline/PASApipeline/wiki/PASA_abinitio_training_sets). This is to filter out the de novo transcripts that essentially aren't valid gene models (https://github.com/TransDecoder/TransDecoder/wiki).

  2. What funannotate train does with those results is to map the RNA-seq data back to these models using Kallisto and chooses the "best" model at each locus (ie which transcript at a locus has the most reads mapped or TPM (transcripts per million)). The logic here might fall apart if genes are highly tissue specific or over represented in a portion of combined datasets, etc.

The idea of selecting the best model here to move onto training the ab initio predictors is for a couple of reasons -- at this point I don't want alternative transcripts as we will add those back with funannotate update. Along similar lines we don't want to use gene models that are similar to one another for training the other predictors.

So then what is happening with these models for training in funannotate predict?

  1. When you run funannotate predict and pass --pasa_gff (either directly or if detected by previous run of train), it will pass all of those models to EVM and give it a default weight of 6. So this means it takes all models that pass TransDecoder and passes those to EVM.
  2. The --pasa_gff is used as input to generate a subset/filtered training set for Augustus, GlimmerHMM, and snap. There are a few steps in this process, but the idea is to filter for multiple exons, high intron/exon overlap with evidences provided, and ensure that no two gene models are within 80% pident at the protein level. This smaller subset then is used only for training.

Some potential areas where you could give EVM more PASA models.

  1. Possible to pass more original PASA transcripts to EVM -- however, need to make sure here that they are actual protein coding gene models if they are passed to EVM as 'other_predictions'. So I basically just used the existing TransDecoder workflow, there might be an improved or alternative method for doing that? Would have to look into this a bit to see if there are valid genes getting missed by TransDecoder.
  2. As part of the Kallisto selection of a single gene model per locus, we could rescue the genes that don't pass this step (ie alternative transcripts) -- and potentially pass those to EVM maybe with a reduced weighting? However, funannotate predict is setup to basically yield a single consensus gene model per locus, so if only alternative transcripts are in this set, I'm not sure they will help much. Might need to look in detail at some of these to make sure that is actually what is happening.
  3. One could rescue the non-protein coding PASA transcripts from before TransDecoder -- maybe those could be passed to EVM as transcript evidence. This would be partially redundant with the rna_bam processing to find intron/exon boundaries as hints for Augustus. However, this might be a good idea to try -- I've just not tested so I don't know if will bring value or not.

Per the question "All the other ab initio predictors come up with about 100K gene models." Maybe this means that the training isn't working great -- or as you said something more intrinsic to those tools. How many gene models are you expecting? Are the models from GlimmerHMM/snap much shorter? How many of your PASA gene models are being used for training?

EarlyEvol commented 4 years ago

Thanks for the detailed answer. Disclaimer: I'm still on version 1.7 It looks like only the training pasa models are making it to EVM. Does 1.7 behave differently?

Here are some logs:

Training:

[12:20 AM]: Mapping reads using pseudoalignment in Kallisto
[12:23 AM]: Parsing expression value results. Keeping best transcript at each locus.
[12:35 AM]: Wrote 22,835 PASA gene models

Predict:

[08/01/20 01:53:50]: 22,835 PASA genes; 11,203 have multi-CDS; 4,409 from filterGeneMark; 3,545 have multi-CDS
[08/01/20 01:53:50]: filterGeneMark GTF filter set to True; require genes with multiple CDS set to True
[08/01/20 01:53:57]: diamond makedb --in augustus.training.proteins.fa --db aug_training.dmnd
[08/01/20 01:53:57]: diamond blastp --query augustus.training.proteins.fa --db aug_training.dmnd --more-sensitive -o aug.blast.txt -f 6 qseqid sseqid pident --query-cover 80 --subject-cover 80 --id 80 --no-self-hits
[08/01/20 01:54:01]: 138 models fail blast identity threshold
[08/01/20 01:54:11]: 19,428 models will be ignored for training Augustus
[08/01/20 01:54:30]: 3,297 of 22,835 models pass training parameters

...

[08/01/20 04:08:49]: Prediction sources: [u'Augustus', u'HiQ', u'GeneMark', u'GlimmerHMM', u'pasa']
[08/01/20 04:08:54]: Summary of gene models: {u'GeneMark': 98970, u'HiQ': 7315, u'pasa': 3297, u'Augustus': 29229, u'GlimmerHMM': 123353, u'total': 262164}
[08/01/20 04:08:54]: EVM Weights: {u'GeneMark': 1, u'HiQ': 5, u'pasa': 10, u'proteins': 1, u'Augustus': 3, u'GlimmerHMM': 1, u'transcripts': 1}
[08/01/20 04:08:54]: Summary of gene models passed to EVM (weights):
[08/01/20 04:08:54]: Launching EVM via funannotate-runEVM.py

The gene_prediction file only has 3297 pasa models.

[earlm1@login3 predict_misc]$ cat gene_predictions.gff3 | grep pasa | grep gene | wc -l
3297

If this is just the old behavior and the new version changes it, let me know and I'll just suck it up and update.

As for the other predictors: I'm expecting something like 15K loci although there are no close relatives sequenced. Yeah the genmark/glimmer/snap predictions are primarily fragmented and there are a ton out in heterochromatin desert with no functional annotations, so I'm guessing they are erroneous. I haven't spent much time seeing if I can get glimmer or snap to perform better. Augustus seem to look really good for everything that has RNA. I might take a stab at it for my next annotations for this group.

Thanks, Earl

nextgenusfs commented 4 years ago

I will have to look at the code in 1.7.4 -- but if that's what it is doing than sounds like a bug as that was not the intention. Let me ensure it's working the way i said it was in master. I'm trying to get a new release out shortly just testing on a few larger genomes before tagging a new release.

nextgenusfs commented 4 years ago

I don't see any obvious changes in v1.7 or 1.7.4 as it relates to processing the PASA input GFF3 files and passing to EVM.... so this must mean its still a bug. I'll dig into it now and see if I can figure it out.

EarlyEvol commented 4 years ago

I just looked at logs from 2 other genomes. EVM got passed the larger "best" PASA models, not just the augustus training set. Must be something weird. I'm completely rerunning predict now. Thanks for spending time on this.

nextgenusfs commented 4 years ago

I think I found the bug in the current code -- pushing a change now. It might have only shown up on a re-run of predict.

EarlyEvol commented 4 years ago

NIce! glad you found it. That makes sense, I first attempted to run train and predict together. It got killed at some point within predict and I restarted. I think I deleted the incomplete intermediates, but left most of predict_misc.

nextgenusfs commented 4 years ago

Let me know if a fresh run results in the same problem -- if it does lets look specifically at the version you are running and perhaps I can just tell you which line to comment out if you don't want to upgrade. Generally I would recommend upgrading always -- but I realize you want to be consistent especially if doing some comparisons between genomes.

EarlyEvol commented 4 years ago

Jon, Sorry for the delayed response, I got distracted trying to move stuff over to a new compute cluster. On the rerun, predict with no intermediates, still just sent the PASA training models to EVM. Not sure what the difference was for this assembly vs. the others I have annotated. I checked out your recent commits and it looked like I needed to comment out lines 1298 and 1299. That did the trick! Thanks for checking this out for me! Earl

nextgenusfs commented 4 years ago

Great Earl -- thanks for following up. Yes it looks like those two lines on a re-run were basically over-writing the input PASA to the PASA-training set.