nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 85 forks source link

Is EVM getting all gene models? #240

Closed PlantDr430 closed 5 years ago

PlantDr430 commented 5 years ago

Version - 1.5.1

During the run I noticed that all of the gff3's that I passed into the prediction function were given ID's (other_pred 1-4) as I passed four gff's in. When I looked at the predict log file, I saw this.

[11/29/18 06:58:28]: Summary of gene models: {'genemark': 8292, 'hiq': 2244, 'other_pred3': 7719, 'pasa': 0, 'other_pred1': 9189, 'augustus': 5111, 'total': 32995, 'other_pred2': 440} [11/29/18 06:58:28]: EVM Weights: {'genemark': '1', 'hiq': '5', 'other_pred4': '1', 'other_pred3': '1', 'other_pred2': '1', 'other_pred1': '1', 'augustus': '1'}

Notice in the "summary of gene models" there is no other_pred4, but there is a other_pred4 in the EVM weights.

I also see that other_pred4 was parsed in. And should have about 8433 genes. So the "total" isn't showing the gene models from this gff.

[11/29/18 02:20:39]: Parsing GFF pass-through: LM84_glimmerhmm.gff3 --> setting source to other_pred4 [11/29/18 02:20:39]: perl /data/linuxbrew/.linuxbrew/opt/evidencemodeler/EvmUtils/gff3_gene_prediction_file_validator.pl /data/wyka/funannotate/LM84/LM84_fun_output/predict_misc/other4_predictions.gff3

I have attached the predict log file and the glimmerHMM.gff3 file. Let me know if you need another log file. I just want to make sure EVM is processing all of the gene models.

funannotate-predict.log

LM84_glimmerhmm.txt

PlantDr430 commented 5 years ago

I have a hunch that the problem is that this gff3 doesn't have "exon" listed in column 3, but uses CDS. Could EVM not be picking these up, because it is strictly looking for the word "exon"?

nextgenusfs commented 5 years ago

Its probably not formatted correctly for EVM -- I will try to have a look when I get home from work. But you can test if it is valid by using a script in the EVidenceModeler distribution.

$EVM_HOME/EvmUtils/gff3_gene_prediction_file_validator.pl

What other predictors are you using? There has been a request to add SNAP and/or GlimmerHMM.

PlantDr430 commented 5 years ago

Yea, it's not formatted correctly. I had to use the converter from biocode to convert it.

It was I who requested SNAP. I am actually using SNAP, GlimmerHMM, CEGMA (I used to use this to train SNAP, but switched to BUSCO and just decided to add the ~400 or so genes from CEGMA into my pipeline as well), and GeneID. Both SNAP and GlimmerHMM I trained for each isolate with all single copy BUSCOs, while GeneID was trained with using BUSCOs, from the currently published closely related species, that are present in all of the samples I am annotating (so around 2,500 BUSCOs or so). I decided to do this method for GeneID, because the parameter file creation steps are rather long. Although, I am still debating about this method.

On Thu, Nov 29, 2018 at 4:03 PM Jon Palmer notifications@github.com wrote:

Its probably not formatted correctly for EVM -- I will try to have a look when I get home from work. But you can test if it is valid by using a script in the EVidenceModeler distribution.

$EVM_HOME/EvmUtils/gff3_gene_prediction_file_validator.pl

What other predictors are you using? There has been a request to add SNAP and/or GlimmerHMM.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/240#issuecomment-443026757, or mute the thread https://github.com/notifications/unsubscribe-auth/AcPRPy6b3CZCl_cVG3xvX6ETsJpvCtgQks5u0GewgaJpZM4Y6O0Q .

PlantDr430 commented 5 years ago

So I am not seeing any result differences in the number of gene models which is coming out of the final EVM if I pass other GFF's directly to the final EVM run than if I didn't pass other GFF's at all. Which I guess is good and shows that your script does an excellent job at identifying genes with confidence, but I would assume there would be slight differences in gene structure using the other GFF's versus not using them.

Although, I've been trying to look through the "predict" script to find out where / how I can pass these other GFF's into the first EVM run when you are looking for high quality predictions to train Augustus. I think I found the location (lines 914 - 997). But I am still learning python, so it might take some time for me to come up with a way to pass the other GFF's here. I wanted to ask if in the future, if you add SNAP or GlimmerHMM or both. Will their predictions be used to find high quality predictions prior to the AUGUSTUS run?

nextgenusfs commented 5 years ago

Since both snap and glimmerhmm need to be trained, the logical step is to train them using PASA results if present and default to Busco2 results. The first EVM step in funannotate predict when no RNA seq data is to first run Busco2 and then capture that output. Since genemark is self training I then slice the genemark models that overlap with the Augustus mediated busco predictions - those are then fed to EVM with any evidence in those regions to generate the final Busco2 models. Then those Busco models are used to train Augustus and then theoretically snap/glimmer. CodingQuarry can only be used with RNA-seq data. After all de novo steps are completed, then all of the models are fed to a new instance of EVM to generate consensus predictions.

When you add the custom gff3 files they are added at the final EVM stage with minimal processing, basically if they pass the EVM validation script then they are added to the pool of de novo predictions.

PlantDr430 commented 5 years ago

Ah I see, okay. So the main difference would be that snap and glimmerhmm would be trained with the high quality BUSCO models that are used to train AUGUSTUS, instead of just using all BUSCOS as I have used. If you did decide to implement snap/glimmerhmm would you also look for high quality snap/glimmerhmm predictions (>90% coverage of hints) and parse those out and add a higher EVM weight to those, as you do with AUGUSTUS prediction?

On Fri, Nov 30, 2018 at 9:10 AM Jon Palmer notifications@github.com wrote:

Since both snap and glimmerhmm need to be trained, the logical step is to train them using PASA results if present and default to Busco2 results. The first EVM step in funannotate predict when no RNA seq data is to first run Busco2 and then capture that output. Since genemark is self training I then slice the genemark models that overlap with the Augustus mediated busco predictions - those are then fed to EVM with any evidence in those regions to generate the final Busco2 models. Then those Busco models are used to train Augustus and then theoretically snap/glimmer. CodingQuarry can only be used with RNA-seq data. After all de novo steps are completed, then all of the models are fed to a new instance of EVM to generate consensus predictions.

When you add the custom gff3 files they are added at the final EVM stage with minimal processing, basically if they pass the EVM validation script then they are added to the pool of de novo predictions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/240#issuecomment-443253012, or mute the thread https://github.com/notifications/unsubscribe-auth/AcPRP_Hf5MkQBNXDRG8vpzqyCa8jQQmoks5u0VhsgaJpZM4Y6O0Q .

nextgenusfs commented 5 years ago

The trick for HiQ is that the information for the exon/intron support is output in the non-standard Augustus gff3 format so I can somewhat easily parse it. Basically I’m doing that to skew the weighting in EVM as we have a priori knowledge that that particular Augustus prediction is correct. Theoretically this is what EVM is doing, weighting the models that align with the evidence (which is why transcript evidence is important). So in the event that there is not a HiQ model, then we just let EVM do it’s normal scoring system on any predictions in that region. If I had code that could weight all the models then I guess we wouldn’t use EVM, but no plans to implement that.

PlantDr430 commented 5 years ago

Okay, thanks for the response.

On Fri, Nov 30, 2018 at 9:54 AM Jon Palmer notifications@github.com wrote:

The trick for HiQ is that the information for the exon/intron support is output in the non-standard Augustus gff3 format so I can somewhat easily parse it. Basically I’m doing that to skew the weighting in EVM as we have a priori knowledge that that particular Augustus prediction is correct. Theoretically this is what EVM is doing, weighting the models that align with the evidence (which is why transcript evidence is important). So in the event that there is not a HiQ model, then we just let EVM do it’s normal scoring system on any predictions in that region. If I had code that could weight all the models then I guess we wouldn’t use EVM, but no plans to implement that.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/240#issuecomment-443267160, or mute the thread https://github.com/notifications/unsubscribe-auth/AcPRP7YcqgECA4QE1juQ_ooUkVpaoqZzks5u0WLJgaJpZM4Y6O0Q .

nextgenusfs commented 5 years ago

Do you still have a need/want to incorporate snap and/or glimmerhmm?

PlantDr430 commented 5 years ago

Will sign on and try to new version ASAP. As for snap and glimmerhmm I have incorporated them into my pre-funannotate run along with other gene model predictors.

nextgenusfs commented 5 years ago

Okay, are you passing the output then to the --other_gff? Looks like the trainGlimmerHMM on Bioconda is broken, I'll try to get that fixed and then I can probably add snap/glimmer directly to the predict workflow, i.e. using the same Augustus training data to train each of them.

PlantDr430 commented 5 years ago

Yea I am passing all my other gff's through the --other_gff flag.

nextgenusfs commented 5 years ago

Latest commit adds snap and GlimmerHMM -- they will each be trained by filtered PASA models or default back to BUSCO2 models (if no PASA models). I'd be interested to hear your feedback prior to pushing the release. I also added some more fine-tune controls over the EVM weights via the -w,--weights option where you can pass the prediction:weight, i.e. you could turn off coding quarry and set snap to 2 by passing -w codingquarry:0 snap:2

PlantDr430 commented 5 years ago

Nice, unfortunately I will not be re-running any more predictions on my current genomes (although I will be using funannotate for all future work). When my annotate run finishes I will run one of my genomes through without my pre-predicted snap and glimmer and let you know how that runs.

Good touch with the -w, --weights options as well.

hyphaltip commented 5 years ago

awesome, I will try out the additional predictors in near future on our datasets - do you think you'll push 1.6 out sooner or should I play with the trunk code for now?

nextgenusfs commented 5 years ago

Was hoping you guys could give it a try before I tag a release. I’m thinking this is close to the last significant update before we get this thing published.

PlantDr430 commented 5 years ago

My annotate runs should be done definitely be tomorrow. I will run some genomes through after that to compare with my previous predictions.

PlantDr430 commented 5 years ago

Runs finished early so I was testing some genomes. Hit a snag with some of the other gff's that I am passing through.

[05/28/19 13:50:16]: /data/wyka/funannotate-master/bin/funannotate-predict.py -i CCC535_masked.fasta -o CCC535_fun_output -s Claviceps spartinae --strain CCC535 --SeqCenter NCBI --SeqAccession SRPX00000000 --repeats2evm --transcript_evidence /data/wyka/Claviceps_ESTs.fasta --protein_evidence /data/linuxbrew/.linuxbrew/opt/funannotate/database/uniprot_sprot.fasta /data/wyka/Pconf.fasta --cpus 24 --busco_seed_species Claviceps_purpurea_refined --busco_db sordariomyceta_odb9 --name E4U52 --optimize_augustus --soft_mask 1000 --min_protlen 50 --other_gff CCC535_geneid.gff3:1 CCC535_gemoma.gff3:1 CCC535_codingquarry.gff3:1

[05/28/19 13:50:16]: OS: linux2, 24 cores, ~ 264 GB RAM. Python: 2.7.12
[05/28/19 13:50:16]: Running funannotate v1.6.0-ac857a9
[05/28/19 13:50:16]: {'genemark': 1, 'hiq': 2, 'glimmerhmm': 1, 'pasa': 6, 'snap': 1, 'transcripts': 1, 'proteins': 1, 'codingquarry': 2, 'augustus': 1}
[05/28/19 13:50:17]: Augustus training set for claviceps_spartinae_ccc535 already exists. To re-train provide unique --augustus_species argument
[05/28/19 13:50:18]: AUGUSTUS (3.2.2) detected, version seems to be compatible with BRAKER and BUSCO
[05/28/19 13:50:18]: Parsing GFF pass-through: CCC535_geneid.gff3 --> setting source to other_pred1
[05/28/19 13:50:19]: perl /data/linuxbrew/.linuxbrew/opt/evidencemodeler/EvmUtils/gff3_gene_prediction_file_validator.pl /data/wyka/test_funannotate/CCC535/CCC535_fun_output/predict_misc/other1_predictions.gff3

Above is from the log file, below is the error from the terminal

[01:50 PM]: AUGUSTUS (3.2.2) detected, version seems to be compatible with BRAKER and BUSCO
[01:50 PM]: Parsing GFF pass-through: CCC535_geneid.gff3 --> setting source to other_pred1
Traceback (most recent call last):
  File "/data/wyka/funannotate-master/bin/funannotate-predict.py", line 391, in <module>
    if not featurename in startWeights:
NameError: name 'startWeights' is not defined
nextgenusfs commented 5 years ago

Thanks -- sorry typo, should be fixed here (hopefully): https://github.com/nextgenusfs/funannotate/commit/0eff679df12381edf7b7e7830d10c0bd7c0bb326

PlantDr430 commented 5 years ago

I cloned and ran v1.6.0-0eff679, but I don't think SNAP or GlimmerHMM ran at all. Attached is the log file. funannotate-predict.log_v1.6.0-0eff679.txt

nextgenusfs commented 5 years ago

And you can confirm that snap, fathom, glimmerhmm, trainGlimmerHMM are in your path? I will look at it later today

PlantDr430 commented 5 years ago

No they aren't in root, so I will have to export the paths. I didn't see additional set up such as $EVM_HOME so it just slipped my mind.

On Wed, May 29, 2019, 6:41 AM Jon Palmer notifications@github.com wrote:

And you can confirm that snap, fathom, glimmerhmm, trainGlimmerHMM are in your path? I will look at it later today

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/240?email_source=notifications&email_token=AHB5CP3WVGEWT4ZGARGEMWTPXZ2W7A5CNFSM4GHI5UIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWPGIBI#issuecomment-496919557, or mute the thread https://github.com/notifications/unsubscribe-auth/AHB5CP6LWDRHASTVI2ALMPTPXZ2W7ANCNFSM4GHI5UIA .

PlantDr430 commented 5 years ago

Alright, well I made sure to have the paths to snap, fathom, glimmerhmm, trainGlimmerHMM executables and to the folders the executables are in to $PATH as I wasn't sure how exactly you were searching for them. Running seemed to just run genemark and busco and then use the models after the EVM run for Augustus training. Attached is the log file. Also I got an error during the initial Augustus training run this time around. funannotate-predict.log_5-29-12pm.txt

[10:21 AM]: Running GeneMark-ES on assembly
[10:28 AM]: Converting GeneMark GTF file to GFF3
[10:28 AM]: Found 7,846 gene models
[10:28 AM]: Running BUSCO to find conserved gene models for training Augustus
[11:05 AM]: 3,435 valid BUSCO predictions found, now formatting for EVM
[11:09 AM]: Setting up EVM partitions
[11:09 AM]: Generating EVM command list
[11:09 AM]: Running EVM commands with 23 CPUs
[11:11 AM]: Combining partitioned EVM outputs
[11:11 AM]: Converting EVM output to GFF3
[11:14 AM]: Collecting all EVM results
[11:14 AM]: 3,332 total gene models from EVM
[11:14 AM]: Checking BUSCO protein models for accuracy
[11:18 AM]: 3,304 BUSCO predictions validated
[11:18 AM]: Training Augustus using BUSCO gene models
Traceback (most recent call last):
  File "/data/wyka/funannotate-master/bin/funannotate-predict.py", line 1064, in <module>
    lib.runSubprocess(cmd, '.', lib.log)
  File "/data/wyka/funannotate-master/lib/library.py", line 694, in runSubprocess
    proc = subprocess.Popen(cmd, cwd=dir, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
  File "/usr/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1343, in _execute_child
    raise child_exception
OSError: [Errno 20] Not a directory
PlantDr430 commented 5 years ago

Sorry for the delay.

I tried again yesterday and was able to run full completions of v1.6.0-0eff679. Did not get the Augustus error this time, and SNAP and GlimmerHMM both worked. The results were very comparable to my results ~ 100 < gene models predicted then my runs, but I also used all BUSCO models to train SNAP and GlimmerHMM so they predicted more genes from all BUSCOs than from the filtered Augustus models you used. Overall, I'd give it a thumbs up and will use your method next time.

PlantDr430 commented 5 years ago

Actually, I think you might be running into a similar problem I had with SNAP in regards to input gff format. If we look at the err.ann of the same genome I saw 3,193 errors from your SNAP run which uses 3,332 input genes from the busco.final.gff3 file while my run of SNAP from BUSCOS had only 24 errors, when I used about 5,000 genes for training, found in the err.ann file. Both files are attached. I believe the problem is caused by the input gff files, as I've noticed that every program accepts a slightly different format for gffs (which drives me crazy sometimes).

err.ann_funannotate-run.txt err.ann_my-run.txt

EDIT: Although, I have been trying to compare your snap.training.zff file with the zff of my run and I can't find anything that looks wrong. So I am not sure why there is a large difference in the amount of errors in the input gene models.

Further Edit: What I am noticing is that from you snap.training.zff file if we look at this one example:

>contig_912
Einit   1598    1556    mrna3331
Exon    1420    968 mrna3331
Eterm   791 37  mrna3331

If you err.ann file it calls an error, but notice that it is looking on contig_677 instead of contig_912

>region-2960 (contig_677 0..2597)
Eterm   791 37  mrna3331
Exon    1420    968 mrna3331
Einit   1598    1556    mrna3331

As opposed to my run. If we look at this error:

>region-786 (contig_174 68..5099)
Eterm   1202    1001    EOG09330RD9
Exon    1673    1274    EOG09330RD9
Exon    2171    1747    EOG09330RD9
Exon    2542    2453    EOG09330RD9
Exon    2675    2655    EOG09330RD9
Exon    3531    3496    EOG09330RD9
Einit   4032    3998    EOG09330RD9

I can find it in my zff with the correct contig

>contig_174
Einit   38963   38820   EOG09331GEW
Eterm   38503   37598   EOG09331GEW
Esngl   46563   44488   EOG09330CZ0
Einit   43414   43367   EOG093307P1
Exon    43234   43147   EOG093307P1
Exon    43061   42991   EOG093307P1
Exon    42918   42686   EOG093307P1
Exon    42566   41645   EOG093307P1
Eterm   41575   40772   EOG093307P1
Einit   49276   49187   EOG09331QDQ
Exon    49020   48803   EOG09331QDQ
Eterm   48730   48676   EOG09331QDQ
Einit   4100    4066    EOG09330RD9
Exon    3599    3564    EOG09330RD9
Exon    2743    2723    EOG09330RD9
Exon    2610    2521    EOG09330RD9
Exon    2239    1815    EOG09330RD9
Exon    1741    1342    EOG09330RD9
Eterm   1270    1069    EOG09330RD9
Esngl   39686   40552   EOG09331GCX
PlantDr430 commented 5 years ago

I figured out the problem. We need to remove contigs/scaffolds from the genome file that do not have gene models in them. Essentially the number of FASTA headers in the genome and zff file need to match. SNAP isn't looking to compare them it just goes in order. So if there are genes in Contigs_1 to_10 but then none in contig_11 but then again there are genes in contig_12. The zff will not contain any calls for contig_11, but since contig_11 still exists in the genome file, SNAP will assume that contig_11 is contig_12 and will try to match the genes found in contig_12 of the zff to contig_11 of the genome.

nextgenusfs commented 5 years ago

@PlantDr430 nice work! When I wrote the wrapper I was using a chromosomal assembly of a finished genome and it doesn't seem to error in the funannotate test data, so I guess I didn't notice. I'll write the fix now -- thanks for digging in!

nextgenusfs commented 5 years ago

Okay, updated snap function will now only run training on scaffolds where there are predictions in the training set (PASA or BUSCO results). https://github.com/nextgenusfs/funannotate/commit/89cacf6b1a0b75031fcff014340b103b2e9f9d92. git pull should get you up to date. Let me know if this push fixes the snap errors, I ran locally on a draft genome and got 0 errors in the err.ann training file. Again -- thanks for troubleshooting.

PlantDr430 commented 5 years ago

I ran another genome through with version 1.6.0-ad5c0de (although this one had pasa models) but did not run SNAP or GlimemrHMM. I was passing pre-run SNAP and GlimmerHMM gff's through, although, I passed them through with the --other_gff flag. Did it pick up that one was SNAP and another was GlimmerHMM and didn't run within the script?

Attached is the log for this genome. I will run the other genome that I used before through tonight. funannotate-predict.log_v1.6.0-ad5c0de.txt

PlantDr430 commented 5 years ago

Just tried my previous genome and it also didn't run SNAP or GlimmerHMM.

nextgenusfs commented 5 years ago

Hmm, I didn’t change any of the logic on whether to run glimmer or snap. I will write a little bit better log info, the tools need to be in PATH.

nextgenusfs commented 5 years ago

Give it a try again -- hopefully the log/terminal is more informative. It should now 1) check for existing result, 2) check that weights is > 0, 3) check if all programs are in $PATH -- if all conditions met then will train/run snap/glimmer.

PlantDr430 commented 5 years ago

Stupid me, it was probably the paths. Sorry about that.

EDIT: It was the paths and the err.ann was empty this time!