Closed gamcil closed 5 years ago
Sorry if instructions aren't clear, from the help menu the --fix
option takes 3 column file that has the following format:
GeneID\tName\tProduct
$ funannotate annotate
Usage: funannotate annotate <arguments>
version: 1.3.3-e1048c0
Description: Script functionally annotates the results from funannotate predict. It pulls
annotation from PFAM, InterPro, EggNog, UniProtKB, MEROPS, CAZyme, and GO ontology.
Required: -i, --input Folder from funannotate predict
or
--genbank Genome in GenBank format
-o, --out Output folder for results
or
--gff Genome GFF3 annotation file
--fasta Genome in multi-fasta format
-s, --species Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
-o, --out Output folder for results
Optional: --sbt NCBI submission template file. (Recommended)
-a, --annotations Custom annotations (3 column tsv file)
--eggnog Eggnog-mapper annotations file (if NOT installed)
--antismash antiSMASH secondary metabolism results (GBK file from output)
--iprscan InterProScan5 XML file
--phobius Phobius pre-computed results (if phobius NOT installed)
--isolate Isolate name
--strain Strain name
--rename Rename GFF gene models with locus_tag from NCBI.
--fix Gene/Product names fixed (TSV: GeneID Name Product)
--remove Gene/Product names to remove (TSV: Gene Product)
--busco_db BUSCO models. Default: dikarya
-t, --tbl2asn Additional parameters for tbl2asn. Example: "-l paired-ends"
-d, --database Path to funannotate database. Default: $FUNANNOTATE_DB
--force Force over-write of output folder
--cpus Number of CPUs to use. Default: 2
This format is found in the Gene2Products.must-fix.txt
output file. The Gene2Products.need-curating.txt
file contains descriptions that funannotate converted into a very generic format, i.e. name=Eif3k; product=Eif3kp
, just adding a 'p' at the end of the gene name - this technically passes NCBI tbl2asn rules, but would be better if it was improved.
I see though how this is confusing -- would it be better/make more sense to print out all of the genes in the Gene2Products.need-curating.txt
file? It is currently doing it this way because of how the dictionaries are organized, but should be able to get the same type of output as in Gene2Products.must-fix.txt
if that is what you are expecting.
Ah, I was under the impression from https://github.com/nextgenusfs/gene2product/issues/2 you could curate the product descriptions and then directly feed them back into annotate. Is the intended use to curate, ensure they pass tbl2asn rules, run update-gene2product.py
to add them to ncbi_cleaned_gene_products.txt
and then re-run annotate from there? If that's the case, what's the best way to check they do pass tbl2asn?
Adding IDs to Gene2Products.need-curating.txt
would be convenient for fixing up individual genomes directly, but if there's a clear way to do the above there's probably no point.
Yes my intent was to curate those gene names in the database which would then benefit more than just the current genome you are annotating, i.e. future annotations and other users. I don’t currently have a quick way of validating they pass tbl2asn other than to run it and find out. The scripts will flag some but probably not all that don’t pass - it’s challenging because there isn’t an easy way to look up which product names pass their tests.
Ok cool I understand. Thanks for the clarification.
In case it's of any interest to anyone, I wrote a script to help out the curation process (https://github.com/gamcil/annotation_scripts/blob/master/grab_gene_descriptions.py). Takes a new line separated file of gene names (i.e. from need-curating.txt
), scrapes NCBI for possible gene descriptions (I figure if they're already on NCBI, should probably pass tbl2asn..) and lets the user pick one. Output is a 2 column TSV that can be used with update-gene2products.py
. Also edited to make automated version which just picks the most common description from the NCBI search (https://github.com/gamcil/annotation_scripts/blob/master/auto_gene_descriptions.py).
Nice, I'll give it a try. So basically I built the Gene2Products database by scrapping NCBI annotation for all fungal genomes -- what I found is that there are a crazy number of descriptions for each gene name and I think it is indicative of how the "rules" have changed over the years from NCBI as they have not (at least that I know of) adopted a standardized set of rules (there are of course some rules that funannotate tries to work around) but you end up with many different descriptions for the same gene. Part of why I thought Gene2Products would be a nice idea is for us to curate names/descriptions and have some consistency between annotations.
I can see that - for most of the gene names I've tried there seems to be at least ~5 different descriptions that are roughly the same with slight differences (e.g. A instead of alpha, II instead of subunit 2 etc..). So it definitely still needs some manual intervention to clean up, and it would be nice to define some nomenclature standards to adhere to. However I'm finding the NCBI annotations are often still a better starting point - in the new-names-passed.txt
file I'm getting a lot of lacking descriptions like )-reductase or aminotransferase, single letters or other weird crops of bigger descriptions. For this purpose I think my scripts approach is probably reasonable, i.e. query the gene database for the gene name, take a large pool of the results and determine the most common description (the vast majority of records seem to usually have the same one) before manual curation.
Yes some of the longer descriptions come from EggNog -- and some of the ones that look truncated come from the challenging of parsing UniProtKb/SwissProt definitions -- it would be best if they were cleaned up at their source (EggNog/UniProtkb) but I don't know if that is realistic or not or where those names/definitions were derived from. If this works well I can incorporate your scripts into perhaps the funannotate util
menu as a way to look up gene names (if that is okay with you of course).
I think it would be a monumental task to try and clean up each database - only thing we can really do is try and get Gene2Products off the ground so we have more and more mappings of names to reasonable descriptions for our annotations. Happy for you to do whatever you see fit with the scripts :+1:
I’ve been fixing a subset of truncated that made their way from eggnog into gene2annotation but mainly when they got flagged in NCBI submission. I went in and fixed some inconsistency in chitin synthase numbering etc, Though in some cases inconsistency because there isn’t a HUGO style gene naming or product description guidelines.
Jason
Jason E Stajich, PhD Professor and Director, Microbiology Graduate Program Department of Microbiology and Plant Pathology University of California, Riverside http://lab.stajich.org @stajichlab @hyphaltip @zygolife +1 951.827.2363
On May 29, 2018, 7:18 AM -0700, Jon Palmer notifications@github.com, wrote:
Yes some of the longer descriptions come from EggNog -- and some of the ones that look truncated come from the challenging of parsing UniProtKb/SwissProt definitions -- it would be best if they were cleaned up at their source (EggNog/UniProtkb) but I don't know if that is realistic or not or where those names/definitions were derived from. If this works well I can incorporate your scripts into perhaps the funannotate util menu as a way to look up gene names (if that is okay with you of course). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hi Jon,
I'm trying to follow the steps you outline in https://github.com/nextgenusfs/gene2product/issues/2.
The need-curating.txt file after my annotation run has output like this (mostly eggNOG sentences it seems):
Overall, this genome has 10 that need to be curated. So I search the gene names on NCBI and grab cleaner descriptions:
Then I run annotate --fix with this edited need-curating.txt file. However, when this finishes it just spits back the same 10 gene models as needing curation, i.e.:
And the product descriptions in the final genbank file remain as e.g.
This seems to be the relevant section of funannotate-functional.py:
This check doesn't do anything since cols[0] are just the annotation gene names, not the actual gene IDs to use as keys in Annotations. I figure gene ID's need to be added to the need-curating.txt file for this to work?
A quick workaround on my end is just to build sed queries and perform subprocess calls to replace the placeholders with the cleaned descriptions in the all.annotations file, e.g.:
But this obviously doesn't interface with your Gene2Products system at all, and I don't know if the new descriptions are passing tbl2asn.
Any ideas?