nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
306 stars 82 forks source link

Train and predict with different species parameters #636

Open Thatguy027 opened 2 years ago

Thatguy027 commented 2 years ago

Hello

I ran the following commands:

funannotate train -i XZ1516_ragtag_correct_scaffold_masked_nameChange.fasta -o funannotate_run/ \
    --left XZ1516_S1_R1_001.fastq.gz \
    --right XZ1516_S1_R2_001.fastq.gz \
    --stranded RF --species "c_elegans_trsk" \
    --strain XZ1516 --cpus 16

funannotate predict -i XZ1516_ragtag_correct_scaffold_masked_nameChange.fasta \
            -o funannotate_run/ -s "c_elegans_trsk" --strain XZ1516 --cpus 16

And received the following message:

[Sep 02 10:08 AM]: Augustus initial training results:
  Feature       Specificity   Sensitivity
  nucleotides   97.0%         91.6%      
  exons         87.9%         83.2%      
  genes         47.6%         40.4%     
[Sep 02 10:08 AM]: Accuracy seems low, you can try to improve by passing the --optimize_augustus option.

This message motivated me to look up what the parameters in c_elegans_trsk are actually for and it's my understanding that these parameters are for intergenic regions, which are probably not the parameter i want for gene prediction.

Is it possible to re-run just the prediction step with a different species as input?

Or do you suggest re-running training and prediction with the same species?

Thanks!

nextgenusfs commented 2 years ago

Hi @Thatguy027 - I think perhaps the confusion is around the names of the scripts, ie funannotate train actually doesn't do the training but rather generates all the data required for training. While funannotate predict then uses those training data if they exist and trains the ab initio predictors with those data generated from funannotate train.

So with what you ran above -- funannotate train will run Trinity/PASA and generate RNA-seq generated gene models via PASA. These data then reside in the funannotate_run/training directory. When you call funannotate predict it will then grab these data and use the PASA gene models to train the ab initio predictors: Augustus, glimmerhmm, snap.

The output you are seeing here is related to Augustus training -- its actually just some stats that the augustus training script outputs. I'm not necessarily sure how accurate/concerned you should be on the low gene accuracy. You can re-run with the optimize_augustus option which will run a ML like iterative approach to try to improve precision/accuracy during training. While this sounds like a great idea, generally I don't see that it improves these statistics/results much thus it isn't turned on by default. "Accuracy" is always a difficult thing to define when there is no "right/correct" answer about gene models.

One of the output after funannotate predict will be a .parameters.json file that contains the training models for the predictors -- you can add this to your internal database with the funannotate species -s new_species_name -a new_species_name.parameters.json command. That will allow you to re-use those training parameters on a different genome assembly for example. Or you can pass that .parameters.json file to funannotate predict on another genome and it will re-use those parameters.

Per the comment about intergenic regions (this is at least my limited understanding) -- HMM based ab initio gene prediction models are predicting introns -- and then inferring gene models based on those intron predictions.