nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
301 stars 82 forks source link

Poor Snap Performance #386

Closed nhartwic closed 3 years ago

nhartwic commented 4 years ago

Output from a recent test run (funannotate version 1.7.4) below...

Program Training-Method
augustus pretrained
genemark selftraining
glimmerhmm busco
snap busco
Source Weight Count
Augustus 1 5906
Augustus HiQ 2 145
GeneMark 1 6335
GlimmerHMM 1 6068
snap 1 12

...Funannotate lists the snap version as 2006-07-28. Conda lists my snap package as being...

snap 2013_11_29 h516909a_2 bioconda

... Any idea why snap would be producing garbage? Is there some other version of it that I need? Is there any way to disable snap training and just use a pre-existing model from a closely related species?

So far, my best sollution here is to run snap before running funannotate using a pre-existing model, convert that output to gff3 (hopefully with whatever script funannotate uses), pass that gff3 as an 'other_gff' to funannotate predict using weight 1, and set the built in snaps weight to be equal to 0. I'm completely open to other sollutions if there is one.

nextgenusfs commented 4 years ago

Hmm. I have occasionally seen snap act strange like this but I don’t know the cause. Does it do the same thing on another set of contigs? Anything unique or different about your contig names? Would be helpful to figure out why training is failing. What happens when you run it on the test data, ie funannotate test -t busco?

You could manually edit the parameter Json file to point to a different preexisting training data and then pass the json file to predict. I wasn’t aware there are commonly used pre-existing snap parameter files.

nhartwic commented 4 years ago

I don't know if they are commonly used, but the snap repo includes numerous models. You can look at them here. Pretty much everyone should have access to these already downloaded on their system...

https://github.com/KorfLab/SNAP/tree/master/HMM

I don't think I've ever gotten good performance out of snap with any of my assemblies. Glimmer models tend to look weird as well. Glimmerhmm also makes available a handful of trained models.

I've attached the stdout from funannotate test -t busco fun.out.txt

How would I edit the parameter json file? where is it? What should I change it to read?

nextgenusfs commented 4 years ago

You should be getting something that looks like the following (you'll need to redirect stderr and stdout to save in a file). But if you aren't getting these results, then something is amiss with your install I guess. Below is done with the most recent conda recipe, conda create -n funannotate funannotate, followed by export FUNANNOTATE_DB=/path/todir and then funannotate setup -i all.

]$ funannotate test -t busco --cpus 12
#########################################################
Running `funannotate predict` BUSCO-mediated training unit testing
CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate --cpus 12 --species Awesome busco
#########################################################
-------------------------------------------------------
[08:10 PM]: OS: linux2, 32 cores, ~ 264 GB RAM. Python: 2.7.15
[08:10 PM]: Running funannotate v1.7.4
[08:10 PM]: GeneMark not found and $GENEMARK_PATH environmental variable missing. Will skip GeneMark ab-initio prediction.
[08:10 PM]: Parsed training data, run ab-initio gene predictors as follows:
  Program      Training-Method
  augustus     busco
  glimmerhmm   busco
  snap         busco
[08:10 PM]: CodingQuarry will be skipped --> --rna_bam required for training
[08:10 PM]: Loading genome assembly and parsing soft-masked repetitive sequences
[08:10 PM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked
[08:10 PM]: Mapping 1,065 proteins to genome using diamond and exonerate
[08:11 PM]: Found 1,774 preliminary alignments --> aligning with exonerate
[08:11 PM]: Exonerate finished: found 1,336 alignments
[08:11 PM]: Running BUSCO to find conserved gene models for training ab-initio predictors
[08:16 PM]: 373 valid BUSCO predictions found, now formatting for EVM
[08:16 PM]: Running EVM commands with 11 CPUs
[08:16 PM]: Converting to GFF3 and collecting all EVM results
[08:16 PM]: 365 total gene models from EVM, now validating with BUSCO HMM search
[08:17 PM]: 365 BUSCO predictions validated
[08:17 PM]: Training Augustus using BUSCO gene models
[08:17 PM]: Augustus initial training results:
  Feature       Specificity   Sensitivity
  nucleotides   99.3%         84.0%
  exons         69.6%         56.5%
  genes         82.0%         55.8%
[08:17 PM]: Running Augustus gene prediction using awesome_busco parameters
[08:18 PM]: 1,393 predictions from Augustus
[08:18 PM]: Pulling out high quality Augustus predictions
[08:18 PM]: Found 318 high quality predictions from Augustus (>90% exon evidence)
[08:18 PM]: Running SNAP gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[08:18 PM]: 1,304 predictions from SNAP
[08:18 PM]: Running GlimmerHMM gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[08:19 PM]: 1,775 predictions from GlimmerHMM
[08:19 PM]: Summary of gene models passed to EVM (weights):
  Source         Weight   Count
  Augustus       1        1075
  Augustus HiQ   2        318
  GlimmerHMM     1        1775
  snap           1        1304
  Total          -        4472
[08:19 PM]: Running EVM commands with 11 CPUs
[08:21 PM]: Converting to GFF3 and collecting all EVM results
[08:21 PM]: 1,675 total gene models from EVM
[08:21 PM]: Generating protein fasta files from 1,675 EVM models
[08:21 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[08:21 PM]: Found 143 gene models to remove: 0 too short; 0 span gaps; 210 transposable elements
[08:21 PM]: 1,532 gene models remaining
[08:21 PM]: Predicting tRNAs
[08:22 PM]: 112 tRNAscan models are valid (non-overlapping)
[08:22 PM]: Generating GenBank tbl annotation file
[08:22 PM]: Converting to final Genbank format
[08:22 PM]: Collecting final annotation files for 1,644 total gene models
[08:22 PM]: Funannotate predict is finished, output files are in the annotate/predict_results folder
[08:22 PM]: Your next step might be functional annotation, suggested commands:
-------------------------------------------------------
Run InterProScan (Docker required):
funannotate iprscan -i annotate -m docker -c 12

Run antiSMASH:
funannotate remote -i annotate -m antismash -e youremail@server.edu

Annotate Genome:
funannotate annotate -i annotate --cpus 12 --sbt yourSBTfile.txt
-------------------------------------------------------

[08:22 PM]: Training parameters file saved: annotate/predict_results/awesome_busco.parameters.json
[08:22 PM]: Add species parameters to database:

  funannotate species -s awesome_busco -a annotate/predict_results/awesome_busco.parameters.json

#########################################################
SUCCESS: `funannotate predict` BUSCO-mediated training test complete.
#########################################################

As you can see, predict generates a json parameters file after each run containing the training data, it looks like this:

$ python -m json.tool awesome_busco.parameters.json
{
    "augustus": [
        {
            "date": "2020-02-25",
            "path": "/path/test-busco_4327/annotate/predict_misc/ab_initio_parameters/augustus/species/awesome_busco",
            "source": "BUCSCO dikarya",
            "version": "funannotate v1.7.4"
        }
    ],
    "codingquarry": [
        {}
    ],
    "genemark": [
        {}
    ],
    "glimmerhmm": [
        {
            "date": "2020-02-25",
            "path": "/path/test-busco_4327/annotate/predict_misc/ab_initio_parameters/glimmerhmm",
            "source": "BUCSCO dikarya",
            "version": "funannotate v1.7.4"
        }
    ],
    "snap": [
        {
            "date": "2020-02-25",
            "path": "/path/test-busco_4327/annotate/predict_misc/ab_initio_parameters/awesome_busco.snap.hmm",
            "source": "BUCSCO dikarya",
            "version": "funannotate v1.7.4"
        }
    ]
}

So you could theoretically point the snapp HMM file to a pre-trained version and then use this parameter file in predict.

nhartwic commented 4 years ago

Well, I'm getting output that looks similar to that. The biggest difference is that my snap produces 0 predictions which results in funannotate leaving it off this list entirely. All of the counts are slightly off though...

Augustus 1 1086 Augustus HiQ 2 321 GeneMark 1 1538 GlimmerHMM 1 1786 Total - 4731

...If I leave one of the tool entries containing only an empty dict, will that communicate to funannotate that it should follow default behavior with regard to that predictor? for example, If I were to use your paramter file on my system, where genemark IS installed, would funannotate run genemark? If not, how do I get it to do so while pointing snap to pre-trained model?

nextgenusfs commented 4 years ago

The parameter JSON file is simply a way of pointing to the data in a programatic way -- its how you add species parameters files to the database so they can be reused. So you'd run through predict one time to generate a paremters JSON file, then change the SNAP HMM path to point to the HMM that you think will work better, and then run predict again using that updated JSON file. This is not really what I would recommend doing though, seems like a better solution to figure out what is wrong with SNAP -- or don't use it at all, which you can control with --weights snap:0.

If you'd like to try to trouble shoot why snap is failing, here is what should be happening (this is from logfile):

[02/25/20 20:18:01]: Running SNAP gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[02/25/20 20:18:02]: 365 gene models to train snap on 6 scaffolds
[02/25/20 20:18:02]: fathom /path/test-busco_4327/annotate/predict_misc/snap.training.zff /path/test-busco_4327/annotate/predict_misc/snap-training.scaffolds.fasta -categorize 1000 -min-intron 10 -max-intron 3000
[02/25/20 20:18:02]: fathom uni.ann uni.dna -export 1000 -plus
[02/25/20 20:18:02]: forge export.ann export.dna
[02/25/20 20:18:03]: perl /path/envs/funannotate_public/bin/hmm-assembler.pl snap-trained annotate/predict_misc/snaptrain
[02/25/20 20:18:03]: snap annotate/predict_misc/snap-trained.hmm /path/test-busco_4327/annotate/predict_misc/genome.softmasked.fa
[02/25/20 20:18:23]: 1,304 predictions from SNAP

You can run these commands manually to see if there are some errors not found in the logfile, etc. I've also attached the hmm model output from this busco training set, should allow you to see if the problem is in the snap executable itself or the training (my guess would be training).

snap-trained.zip

nhartwic commented 4 years ago

As best I can tell, some of those commands are supposed to be run from the snaptrain folder and the final two commands are meant to be run from inside "test-busco_56173".

None of the commands produce any errors. The model file produced by my command is different from yours though. Specifcally, my parameters in section Einit 2 look like garbage. Some kind of silent error seems to be happening.

snap-trained.hmm.gz

Any ideas where to look next?

EDIT: Further digging, my bad parameters seem to be coming from the file:

predict_misc/snaptrain/Einit-explicit.duration

So, the question becomes, how is this file being generated

hyphaltip commented 4 years ago

Can try training on our platform if you can send the training input export.ann and export.dna to mAke sure forge is working.

Jason Stajich, PhD jasonstajich.phd@gmail.com On Feb 26, 2020, 2:08 PM -0800, nhartwic notifications@github.com, wrote:

As best I can tell, some of those commands are supposed to be run from the snaptrain folder and the final two commands are meant to be run from inside "test-busco_56173". None of the commands produce any errors. The model file produced by my command is different from yours though. Specifcally, my parameters in section Einit 2 look like garbage. Some kind of silent error seems to be happening. snap-trained.hmm.gz Any ideas where to look next? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nhartwic commented 4 years ago

I'm 90% sure the issue was forge. I went ahead and cloned snap and compiled locally. I then modified the dummy forge executable in my conda environment to point to the new copy of forge. I'm now getting performance similar to what was expected. My models file is still different though. Current gene counts..

Source Weight Count Augustus 1 1073 Augustus HiQ 2 320 GlimmerHMM 1 1775 snap 1 1304 Total - 4472

.. New model file snap-trained.hmm.gz

... export files if you want to run them. export.ann.gz export.dna.gz

...This leaves open the question of why the copy of forge installed by conda didn't work properly on my system. But I'd say the issue is essentially resolved.

nextgenusfs commented 4 years ago

Nice -- glad you got it sorted. If we knew why the conda built forge was not working I could try to fix the recipe. But yes we can just leave this open so if others run into this on their system, fix is to manually compile snap.

The bioconda snap was working for me on centOS and macOSX.

jimie0311 commented 2 years ago

hi Joh, I met the similar problem. The gene numbers predicted by snap varied time by time. Do you mean we need to install snap locally

I'm 90% sure the issue was forge. I went ahead and cloned snap and compiled locally. I then modified the dummy forge executable in my conda environment to point to the new copy of forge. I'm now getting performance similar to what was expected. My models file is still different though. Current gene counts..

Source Weight Count Augustus 1 1073 Augustus HiQ 2 320 GlimmerHMM 1 1775 snap 1 1304 Total - 4472

.. New model file snap-trained.hmm.gz

... export files if you want to run them. export.ann.gz export.dna.gz

...This leaves open the question of why the copy of forge installed by conda didn't work properly on my system. But I'd say the issue is essentially resolved.

Hi, could u share the command? How to set the snap path? Thanks.

nextgenusfs commented 2 years ago

If you are on Debian based Linux just install snap with apt-get and uninstall from conda. Or you can compile snap from source. It's just that the bioconda version seems to be broken on Debian based systems.

nhartwic commented 2 years ago

Alternatively, if you need example commands, this appears to be working in my containers...

git clone https://github.com/KorfLab/SNAP.git && \
    cd SNAP/ && \
    make && \
    cp forge /opt/conda/envs/myenv/bin/

Note that "/opt/conda/envs/myenv/bin/" should be replaced with the path to the bin inside the conda environment you are using for funannotate.

leegene1992 commented 1 year ago

I have encountered the problem with SNAP prediction. After replace the forge, the number increase from 2 to 1,504 in my case. Thanks a lot for this open question.