Open Thatguy027 opened 2 years ago
Hi @Thatguy027 - I think perhaps the confusion is around the names of the scripts, ie funannotate train
actually doesn't do the training but rather generates all the data required for training. While funannotate predict
then uses those training data if they exist and trains the ab initio predictors with those data generated from funannotate train
.
So with what you ran above -- funannotate train
will run Trinity/PASA and generate RNA-seq generated gene models via PASA. These data then reside in the funannotate_run/training
directory. When you call funannotate predict
it will then grab these data and use the PASA gene models to train the ab initio predictors: Augustus, glimmerhmm, snap.
The output you are seeing here is related to Augustus training -- its actually just some stats that the augustus training script outputs. I'm not necessarily sure how accurate/concerned you should be on the low gene accuracy. You can re-run with the optimize_augustus option which will run a ML like iterative approach to try to improve precision/accuracy during training. While this sounds like a great idea, generally I don't see that it improves these statistics/results much thus it isn't turned on by default. "Accuracy" is always a difficult thing to define when there is no "right/correct" answer about gene models.
One of the output after funannotate predict
will be a .parameters.json file that contains the training models for the predictors -- you can add this to your internal database with the funannotate species -s new_species_name -a new_species_name.parameters.json
command. That will allow you to re-use those training parameters on a different genome assembly for example. Or you can pass that .parameters.json file to funannotate predict
on another genome and it will re-use those parameters.
Per the comment about intergenic regions (this is at least my limited understanding) -- HMM based ab initio gene prediction models are predicting introns -- and then inferring gene models based on those intron predictions.
Hello
I ran the following commands:
And received the following message:
This message motivated me to look up what the parameters in
c_elegans_trsk
are actually for and it's my understanding that these parameters are for intergenic regions, which are probably not the parameter i want for gene prediction.Is it possible to re-run just the prediction step with a different species as input?
Or do you suggest re-running training and prediction with the same species?
Thanks!