Closed PlantDr430 closed 5 years ago
I have a hunch that the problem is that this gff3 doesn't have "exon" listed in column 3, but uses CDS. Could EVM not be picking these up, because it is strictly looking for the word "exon"?
Its probably not formatted correctly for EVM -- I will try to have a look when I get home from work. But you can test if it is valid by using a script in the EVidenceModeler distribution.
$EVM_HOME/EvmUtils/gff3_gene_prediction_file_validator.pl
What other predictors are you using? There has been a request to add SNAP and/or GlimmerHMM.
Yea, it's not formatted correctly. I had to use the converter from biocode to convert it.
It was I who requested SNAP. I am actually using SNAP, GlimmerHMM, CEGMA (I used to use this to train SNAP, but switched to BUSCO and just decided to add the ~400 or so genes from CEGMA into my pipeline as well), and GeneID. Both SNAP and GlimmerHMM I trained for each isolate with all single copy BUSCOs, while GeneID was trained with using BUSCOs, from the currently published closely related species, that are present in all of the samples I am annotating (so around 2,500 BUSCOs or so). I decided to do this method for GeneID, because the parameter file creation steps are rather long. Although, I am still debating about this method.
On Thu, Nov 29, 2018 at 4:03 PM Jon Palmer notifications@github.com wrote:
Its probably not formatted correctly for EVM -- I will try to have a look when I get home from work. But you can test if it is valid by using a script in the EVidenceModeler distribution.
$EVM_HOME/EvmUtils/gff3_gene_prediction_file_validator.pl
What other predictors are you using? There has been a request to add SNAP and/or GlimmerHMM.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/240#issuecomment-443026757, or mute the thread https://github.com/notifications/unsubscribe-auth/AcPRPy6b3CZCl_cVG3xvX6ETsJpvCtgQks5u0GewgaJpZM4Y6O0Q .
So I am not seeing any result differences in the number of gene models which is coming out of the final EVM if I pass other GFF's directly to the final EVM run than if I didn't pass other GFF's at all. Which I guess is good and shows that your script does an excellent job at identifying genes with confidence, but I would assume there would be slight differences in gene structure using the other GFF's versus not using them.
Although, I've been trying to look through the "predict" script to find out where / how I can pass these other GFF's into the first EVM run when you are looking for high quality predictions to train Augustus. I think I found the location (lines 914 - 997). But I am still learning python, so it might take some time for me to come up with a way to pass the other GFF's here. I wanted to ask if in the future, if you add SNAP or GlimmerHMM or both. Will their predictions be used to find high quality predictions prior to the AUGUSTUS run?
Since both snap and glimmerhmm need to be trained, the logical step is to train them using PASA results if present and default to Busco2 results. The first EVM step in funannotate predict when no RNA seq data is to first run Busco2 and then capture that output. Since genemark is self training I then slice the genemark models that overlap with the Augustus mediated busco predictions - those are then fed to EVM with any evidence in those regions to generate the final Busco2 models. Then those Busco models are used to train Augustus and then theoretically snap/glimmer. CodingQuarry can only be used with RNA-seq data. After all de novo steps are completed, then all of the models are fed to a new instance of EVM to generate consensus predictions.
When you add the custom gff3 files they are added at the final EVM stage with minimal processing, basically if they pass the EVM validation script then they are added to the pool of de novo predictions.
Ah I see, okay. So the main difference would be that snap and glimmerhmm would be trained with the high quality BUSCO models that are used to train AUGUSTUS, instead of just using all BUSCOS as I have used. If you did decide to implement snap/glimmerhmm would you also look for high quality snap/glimmerhmm predictions (>90% coverage of hints) and parse those out and add a higher EVM weight to those, as you do with AUGUSTUS prediction?
On Fri, Nov 30, 2018 at 9:10 AM Jon Palmer notifications@github.com wrote:
Since both snap and glimmerhmm need to be trained, the logical step is to train them using PASA results if present and default to Busco2 results. The first EVM step in funannotate predict when no RNA seq data is to first run Busco2 and then capture that output. Since genemark is self training I then slice the genemark models that overlap with the Augustus mediated busco predictions - those are then fed to EVM with any evidence in those regions to generate the final Busco2 models. Then those Busco models are used to train Augustus and then theoretically snap/glimmer. CodingQuarry can only be used with RNA-seq data. After all de novo steps are completed, then all of the models are fed to a new instance of EVM to generate consensus predictions.
When you add the custom gff3 files they are added at the final EVM stage with minimal processing, basically if they pass the EVM validation script then they are added to the pool of de novo predictions.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/240#issuecomment-443253012, or mute the thread https://github.com/notifications/unsubscribe-auth/AcPRP_Hf5MkQBNXDRG8vpzqyCa8jQQmoks5u0VhsgaJpZM4Y6O0Q .
The trick for HiQ is that the information for the exon/intron support is output in the non-standard Augustus gff3 format so I can somewhat easily parse it. Basically I’m doing that to skew the weighting in EVM as we have a priori knowledge that that particular Augustus prediction is correct. Theoretically this is what EVM is doing, weighting the models that align with the evidence (which is why transcript evidence is important). So in the event that there is not a HiQ model, then we just let EVM do it’s normal scoring system on any predictions in that region. If I had code that could weight all the models then I guess we wouldn’t use EVM, but no plans to implement that.
Okay, thanks for the response.
On Fri, Nov 30, 2018 at 9:54 AM Jon Palmer notifications@github.com wrote:
The trick for HiQ is that the information for the exon/intron support is output in the non-standard Augustus gff3 format so I can somewhat easily parse it. Basically I’m doing that to skew the weighting in EVM as we have a priori knowledge that that particular Augustus prediction is correct. Theoretically this is what EVM is doing, weighting the models that align with the evidence (which is why transcript evidence is important). So in the event that there is not a HiQ model, then we just let EVM do it’s normal scoring system on any predictions in that region. If I had code that could weight all the models then I guess we wouldn’t use EVM, but no plans to implement that.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/240#issuecomment-443267160, or mute the thread https://github.com/notifications/unsubscribe-auth/AcPRP7YcqgECA4QE1juQ_ooUkVpaoqZzks5u0WLJgaJpZM4Y6O0Q .
Do you still have a need/want to incorporate snap and/or glimmerhmm?
Will sign on and try to new version ASAP. As for snap and glimmerhmm I have incorporated them into my pre-funannotate run along with other gene model predictors.
Okay, are you passing the output then to the --other_gff
? Looks like the trainGlimmerHMM on Bioconda is broken, I'll try to get that fixed and then I can probably add snap/glimmer directly to the predict workflow, i.e. using the same Augustus training data to train each of them.
Yea I am passing all my other gff's through the --other_gff flag.
Latest commit adds snap and GlimmerHMM -- they will each be trained by filtered PASA models or default back to BUSCO2 models (if no PASA models). I'd be interested to hear your feedback prior to pushing the release. I also added some more fine-tune controls over the EVM weights via the -w,--weights
option where you can pass the prediction:weight, i.e. you could turn off coding quarry and set snap to 2 by passing -w codingquarry:0 snap:2
Nice, unfortunately I will not be re-running any more predictions on my current genomes (although I will be using funannotate for all future work). When my annotate run finishes I will run one of my genomes through without my pre-predicted snap and glimmer and let you know how that runs.
Good touch with the -w, --weights options as well.
awesome, I will try out the additional predictors in near future on our datasets - do you think you'll push 1.6 out sooner or should I play with the trunk code for now?
Was hoping you guys could give it a try before I tag a release. I’m thinking this is close to the last significant update before we get this thing published.
My annotate runs should be done definitely be tomorrow. I will run some genomes through after that to compare with my previous predictions.
Runs finished early so I was testing some genomes. Hit a snag with some of the other gff's that I am passing through.
[05/28/19 13:50:16]: /data/wyka/funannotate-master/bin/funannotate-predict.py -i CCC535_masked.fasta -o CCC535_fun_output -s Claviceps spartinae --strain CCC535 --SeqCenter NCBI --SeqAccession SRPX00000000 --repeats2evm --transcript_evidence /data/wyka/Claviceps_ESTs.fasta --protein_evidence /data/linuxbrew/.linuxbrew/opt/funannotate/database/uniprot_sprot.fasta /data/wyka/Pconf.fasta --cpus 24 --busco_seed_species Claviceps_purpurea_refined --busco_db sordariomyceta_odb9 --name E4U52 --optimize_augustus --soft_mask 1000 --min_protlen 50 --other_gff CCC535_geneid.gff3:1 CCC535_gemoma.gff3:1 CCC535_codingquarry.gff3:1
[05/28/19 13:50:16]: OS: linux2, 24 cores, ~ 264 GB RAM. Python: 2.7.12
[05/28/19 13:50:16]: Running funannotate v1.6.0-ac857a9
[05/28/19 13:50:16]: {'genemark': 1, 'hiq': 2, 'glimmerhmm': 1, 'pasa': 6, 'snap': 1, 'transcripts': 1, 'proteins': 1, 'codingquarry': 2, 'augustus': 1}
[05/28/19 13:50:17]: Augustus training set for claviceps_spartinae_ccc535 already exists. To re-train provide unique --augustus_species argument
[05/28/19 13:50:18]: AUGUSTUS (3.2.2) detected, version seems to be compatible with BRAKER and BUSCO
[05/28/19 13:50:18]: Parsing GFF pass-through: CCC535_geneid.gff3 --> setting source to other_pred1
[05/28/19 13:50:19]: perl /data/linuxbrew/.linuxbrew/opt/evidencemodeler/EvmUtils/gff3_gene_prediction_file_validator.pl /data/wyka/test_funannotate/CCC535/CCC535_fun_output/predict_misc/other1_predictions.gff3
Above is from the log file, below is the error from the terminal
[01:50 PM]: AUGUSTUS (3.2.2) detected, version seems to be compatible with BRAKER and BUSCO
[01:50 PM]: Parsing GFF pass-through: CCC535_geneid.gff3 --> setting source to other_pred1
Traceback (most recent call last):
File "/data/wyka/funannotate-master/bin/funannotate-predict.py", line 391, in <module>
if not featurename in startWeights:
NameError: name 'startWeights' is not defined
Thanks -- sorry typo, should be fixed here (hopefully): https://github.com/nextgenusfs/funannotate/commit/0eff679df12381edf7b7e7830d10c0bd7c0bb326
I cloned and ran v1.6.0-0eff679, but I don't think SNAP or GlimmerHMM ran at all. Attached is the log file. funannotate-predict.log_v1.6.0-0eff679.txt
And you can confirm that snap, fathom, glimmerhmm, trainGlimmerHMM are in your path? I will look at it later today
No they aren't in root, so I will have to export the paths. I didn't see additional set up such as $EVM_HOME so it just slipped my mind.
On Wed, May 29, 2019, 6:41 AM Jon Palmer notifications@github.com wrote:
And you can confirm that snap, fathom, glimmerhmm, trainGlimmerHMM are in your path? I will look at it later today
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/240?email_source=notifications&email_token=AHB5CP3WVGEWT4ZGARGEMWTPXZ2W7A5CNFSM4GHI5UIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWPGIBI#issuecomment-496919557, or mute the thread https://github.com/notifications/unsubscribe-auth/AHB5CP6LWDRHASTVI2ALMPTPXZ2W7ANCNFSM4GHI5UIA .
Alright, well I made sure to have the paths to snap, fathom, glimmerhmm, trainGlimmerHMM executables and to the folders the executables are in to $PATH as I wasn't sure how exactly you were searching for them. Running seemed to just run genemark and busco and then use the models after the EVM run for Augustus training. Attached is the log file. Also I got an error during the initial Augustus training run this time around. funannotate-predict.log_5-29-12pm.txt
[10:21 AM]: Running GeneMark-ES on assembly
[10:28 AM]: Converting GeneMark GTF file to GFF3
[10:28 AM]: Found 7,846 gene models
[10:28 AM]: Running BUSCO to find conserved gene models for training Augustus
[11:05 AM]: 3,435 valid BUSCO predictions found, now formatting for EVM
[11:09 AM]: Setting up EVM partitions
[11:09 AM]: Generating EVM command list
[11:09 AM]: Running EVM commands with 23 CPUs
[11:11 AM]: Combining partitioned EVM outputs
[11:11 AM]: Converting EVM output to GFF3
[11:14 AM]: Collecting all EVM results
[11:14 AM]: 3,332 total gene models from EVM
[11:14 AM]: Checking BUSCO protein models for accuracy
[11:18 AM]: 3,304 BUSCO predictions validated
[11:18 AM]: Training Augustus using BUSCO gene models
Traceback (most recent call last):
File "/data/wyka/funannotate-master/bin/funannotate-predict.py", line 1064, in <module>
lib.runSubprocess(cmd, '.', lib.log)
File "/data/wyka/funannotate-master/lib/library.py", line 694, in runSubprocess
proc = subprocess.Popen(cmd, cwd=dir, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
File "/usr/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1343, in _execute_child
raise child_exception
OSError: [Errno 20] Not a directory
Sorry for the delay.
I tried again yesterday and was able to run full completions of v1.6.0-0eff679. Did not get the Augustus error this time, and SNAP and GlimmerHMM both worked. The results were very comparable to my results ~ 100 < gene models predicted then my runs, but I also used all BUSCO models to train SNAP and GlimmerHMM so they predicted more genes from all BUSCOs than from the filtered Augustus models you used. Overall, I'd give it a thumbs up and will use your method next time.
Actually, I think you might be running into a similar problem I had with SNAP in regards to input gff format. If we look at the err.ann of the same genome I saw 3,193 errors from your SNAP run which uses 3,332 input genes from the busco.final.gff3 file while my run of SNAP from BUSCOS had only 24 errors, when I used about 5,000 genes for training, found in the err.ann file. Both files are attached. I believe the problem is caused by the input gff files, as I've noticed that every program accepts a slightly different format for gffs (which drives me crazy sometimes).
err.ann_funannotate-run.txt err.ann_my-run.txt
EDIT: Although, I have been trying to compare your snap.training.zff file with the zff of my run and I can't find anything that looks wrong. So I am not sure why there is a large difference in the amount of errors in the input gene models.
Further Edit: What I am noticing is that from you snap.training.zff file if we look at this one example:
>contig_912
Einit 1598 1556 mrna3331
Exon 1420 968 mrna3331
Eterm 791 37 mrna3331
If you err.ann file it calls an error, but notice that it is looking on contig_677 instead of contig_912
>region-2960 (contig_677 0..2597)
Eterm 791 37 mrna3331
Exon 1420 968 mrna3331
Einit 1598 1556 mrna3331
As opposed to my run. If we look at this error:
>region-786 (contig_174 68..5099)
Eterm 1202 1001 EOG09330RD9
Exon 1673 1274 EOG09330RD9
Exon 2171 1747 EOG09330RD9
Exon 2542 2453 EOG09330RD9
Exon 2675 2655 EOG09330RD9
Exon 3531 3496 EOG09330RD9
Einit 4032 3998 EOG09330RD9
I can find it in my zff with the correct contig
>contig_174
Einit 38963 38820 EOG09331GEW
Eterm 38503 37598 EOG09331GEW
Esngl 46563 44488 EOG09330CZ0
Einit 43414 43367 EOG093307P1
Exon 43234 43147 EOG093307P1
Exon 43061 42991 EOG093307P1
Exon 42918 42686 EOG093307P1
Exon 42566 41645 EOG093307P1
Eterm 41575 40772 EOG093307P1
Einit 49276 49187 EOG09331QDQ
Exon 49020 48803 EOG09331QDQ
Eterm 48730 48676 EOG09331QDQ
Einit 4100 4066 EOG09330RD9
Exon 3599 3564 EOG09330RD9
Exon 2743 2723 EOG09330RD9
Exon 2610 2521 EOG09330RD9
Exon 2239 1815 EOG09330RD9
Exon 1741 1342 EOG09330RD9
Eterm 1270 1069 EOG09330RD9
Esngl 39686 40552 EOG09331GCX
I figured out the problem. We need to remove contigs/scaffolds from the genome file that do not have gene models in them. Essentially the number of FASTA headers in the genome and zff file need to match. SNAP isn't looking to compare them it just goes in order. So if there are genes in Contigs_1 to_10 but then none in contig_11 but then again there are genes in contig_12. The zff will not contain any calls for contig_11, but since contig_11 still exists in the genome file, SNAP will assume that contig_11 is contig_12 and will try to match the genes found in contig_12 of the zff to contig_11 of the genome.
@PlantDr430 nice work! When I wrote the wrapper I was using a chromosomal assembly of a finished genome and it doesn't seem to error in the funannotate test
data, so I guess I didn't notice. I'll write the fix now -- thanks for digging in!
Okay, updated snap
function will now only run training on scaffolds where there are predictions in the training set (PASA or BUSCO results). https://github.com/nextgenusfs/funannotate/commit/89cacf6b1a0b75031fcff014340b103b2e9f9d92. git pull
should get you up to date. Let me know if this push fixes the snap
errors, I ran locally on a draft genome and got 0 errors in the err.ann
training file. Again -- thanks for troubleshooting.
I ran another genome through with version 1.6.0-ad5c0de (although this one had pasa models) but did not run SNAP or GlimemrHMM. I was passing pre-run SNAP and GlimmerHMM gff's through, although, I passed them through with the --other_gff flag. Did it pick up that one was SNAP and another was GlimmerHMM and didn't run within the script?
Attached is the log for this genome. I will run the other genome that I used before through tonight. funannotate-predict.log_v1.6.0-ad5c0de.txt
Just tried my previous genome and it also didn't run SNAP or GlimmerHMM.
Hmm, I didn’t change any of the logic on whether to run glimmer or snap. I will write a little bit better log info, the tools need to be in PATH.
Give it a try again -- hopefully the log/terminal is more informative. It should now 1) check for existing result, 2) check that weights is > 0, 3) check if all programs are in $PATH -- if all conditions met then will train/run snap/glimmer.
Stupid me, it was probably the paths. Sorry about that.
EDIT: It was the paths and the err.ann was empty this time!
Version - 1.5.1
During the run I noticed that all of the gff3's that I passed into the prediction function were given ID's (other_pred 1-4) as I passed four gff's in. When I looked at the predict log file, I saw this.
[11/29/18 06:58:28]: Summary of gene models: {'genemark': 8292, 'hiq': 2244, 'other_pred3': 7719, 'pasa': 0, 'other_pred1': 9189, 'augustus': 5111, 'total': 32995, 'other_pred2': 440} [11/29/18 06:58:28]: EVM Weights: {'genemark': '1', 'hiq': '5', 'other_pred4': '1', 'other_pred3': '1', 'other_pred2': '1', 'other_pred1': '1', 'augustus': '1'}
Notice in the "summary of gene models" there is no other_pred4, but there is a other_pred4 in the EVM weights.
I also see that other_pred4 was parsed in. And should have about 8433 genes. So the "total" isn't showing the gene models from this gff.
[11/29/18 02:20:39]: Parsing GFF pass-through: LM84_glimmerhmm.gff3 --> setting source to other_pred4 [11/29/18 02:20:39]: perl /data/linuxbrew/.linuxbrew/opt/evidencemodeler/EvmUtils/gff3_gene_prediction_file_validator.pl /data/wyka/funannotate/LM84/LM84_fun_output/predict_misc/other4_predictions.gff3
I have attached the predict log file and the glimmerHMM.gff3 file. Let me know if you need another log file. I just want to make sure EVM is processing all of the gene models.
funannotate-predict.log
LM84_glimmerhmm.txt