nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
312 stars 83 forks source link

annotate --fix does nothing #174

Closed gamcil closed 5 years ago

gamcil commented 6 years ago

Hi Jon,

I'm trying to follow the steps you outline in https://github.com/nextgenusfs/gene2product/issues/2.

The need-curating.txt file after my annotation run has output like this (mostly eggNOG sentences it seems):

Name Original Description Cleaned Description Error-message

EIF3H Component of the eukaryotic translation initiation factor 3 (eIF-3) complex, which is involved in protein synthesis and, together with other initiation factors, stimulates binding of mRNA and methionyl-tRNAi to the 40S ribosome (By similarity) Eif3hp Product defline failed funannotate checks EIF3K Component of the eukaryotic translation initiation factor 3 (eIF-3) complex, which is involved in protein synthesis and, together with other initiation factors, stimulates binding of mRNA and methionyl-tRNAi to the 40S ribosome (By similarity) Eif3kp Product defline failed funannotate checks METTL1 Catalyzes the formation of N(7)-methylguanine at position 46 (m7G46) in tRNA (By similarity) Mettl1p Product defline failed funannotate checks ...

Overall, this genome has 10 that need to be curated. So I search the gene names on NCBI and grab cleaner descriptions:

Name Original Description Cleaned Description Error-message

EIF3H eukaryotic translation initiation factor 3 subunit H Eif3hp Product defline failed funannotate checks EIF3K eukaryotic translation initiation factor 3 subunit K Eif3kp Product defline failed funannotate checks METTL1 methyltransferase like 1 Mettl1p Product defline failed funannotate checks ...

Then I run annotate --fix with this edited need-curating.txt file. However, when this finishes it just spits back the same 10 gene models as needing curation, i.e.:

10 gene/product names need to be curated, see aban/annotate_results/Gene2Products.need-curating.txt

And the product descriptions in the final genbank file remain as e.g.

/product="Eif3kp"

This seems to be the relevant section of funannotate-functional.py:

#add annotation to tbl annotation file, generate dictionary of dictionaries with values as a list
#need to keep multiple transcripts annotations separate, so this approach may have to modified
Annotations = lib.annotations2dict(ANNOTS)

#to update annotations, user can pass --fix or --remove, update Annotations here
if args.fix:
    with open(args.fix, 'rU') as fixfile:
        for line in fixfile:
            line = line.strip()
            if line.startswith('#'):
                continue
            cols = line.split('\t') #ID Name Description Error (could be more columns i guess)
            if len(cols) < 3: #skip if number of columns isn't correct
                continue
            if cols[0] in Annotations:
                Annotations[cols[0]]['name'] = [cols[1]]
                Annotations[cols[0]]['product'] = [cols[2]]
            if cols[1] in NotInCurated:
                NotInCurated[cols[1]] = [cols[2]]
            if cols[1] in NeedCurating:
                old = NeedCurating.get(cols[1])
                NeedCurating[cols[1]] = (old[0], cols[2])
            if cols[0] in Gene2ProdFinal:
Gene2ProdFinal[cols[0]] = (cols[1], cols[2])

This check doesn't do anything since cols[0] are just the annotation gene names, not the actual gene IDs to use as keys in Annotations. I figure gene ID's need to be added to the need-curating.txt file for this to work?

A quick workaround on my end is just to build sed queries and perform subprocess calls to replace the placeholders with the cleaned descriptions in the all.annotations file, e.g.:

sed_cmd = 's/' + cols[2] + '/' + cols[1] + '/g'
subprocess.call(["sed, "-i", "-e", sed_cmd, ANNOTS])

But this obviously doesn't interface with your Gene2Products system at all, and I don't know if the new descriptions are passing tbl2asn.

Any ideas?

nextgenusfs commented 6 years ago

Sorry if instructions aren't clear, from the help menu the --fix option takes 3 column file that has the following format:

GeneID\tName\tProduct
$ funannotate annotate

Usage:       funannotate annotate <arguments>
version:     1.3.3-e1048c0

Description: Script functionally annotates the results from funannotate predict.  It pulls
             annotation from PFAM, InterPro, EggNog, UniProtKB, MEROPS, CAZyme, and GO ontology.

Required:    -i, --input        Folder from funannotate predict
          or
             --genbank          Genome in GenBank format
             -o, --out          Output folder for results
          or   
             --gff              Genome GFF3 annotation file
             --fasta            Genome in multi-fasta format
             -s, --species      Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
             -o, --out          Output folder for results

Optional:    --sbt              NCBI submission template file. (Recommended)
             -a, --annotations  Custom annotations (3 column tsv file)
             --eggnog           Eggnog-mapper annotations file (if NOT installed)
             --antismash        antiSMASH secondary metabolism results (GBK file from output)
             --iprscan          InterProScan5 XML file
             --phobius          Phobius pre-computed results (if phobius NOT installed)
             --isolate          Isolate name
             --strain           Strain name
             --rename           Rename GFF gene models with locus_tag from NCBI.
             --fix              Gene/Product names fixed (TSV: GeneID   Name    Product)
             --remove           Gene/Product names to remove (TSV: Gene Product)
             --busco_db         BUSCO models. Default: dikarya
             -t, --tbl2asn      Additional parameters for tbl2asn. Example: "-l paired-ends"
             -d, --database     Path to funannotate database. Default: $FUNANNOTATE_DB
             --force            Force over-write of output folder
             --cpus             Number of CPUs to use. Default: 2

This format is found in the Gene2Products.must-fix.txt output file. The Gene2Products.need-curating.txt file contains descriptions that funannotate converted into a very generic format, i.e. name=Eif3k; product=Eif3kp, just adding a 'p' at the end of the gene name - this technically passes NCBI tbl2asn rules, but would be better if it was improved.

I see though how this is confusing -- would it be better/make more sense to print out all of the genes in the Gene2Products.need-curating.txt file? It is currently doing it this way because of how the dictionaries are organized, but should be able to get the same type of output as in Gene2Products.must-fix.txt if that is what you are expecting.

gamcil commented 6 years ago

Ah, I was under the impression from https://github.com/nextgenusfs/gene2product/issues/2 you could curate the product descriptions and then directly feed them back into annotate. Is the intended use to curate, ensure they pass tbl2asn rules, run update-gene2product.py to add them to ncbi_cleaned_gene_products.txt and then re-run annotate from there? If that's the case, what's the best way to check they do pass tbl2asn?

Adding IDs to Gene2Products.need-curating.txt would be convenient for fixing up individual genomes directly, but if there's a clear way to do the above there's probably no point.

nextgenusfs commented 6 years ago

Yes my intent was to curate those gene names in the database which would then benefit more than just the current genome you are annotating, i.e. future annotations and other users. I don’t currently have a quick way of validating they pass tbl2asn other than to run it and find out. The scripts will flag some but probably not all that don’t pass - it’s challenging because there isn’t an easy way to look up which product names pass their tests.

gamcil commented 6 years ago

Ok cool I understand. Thanks for the clarification.

In case it's of any interest to anyone, I wrote a script to help out the curation process (https://github.com/gamcil/annotation_scripts/blob/master/grab_gene_descriptions.py). Takes a new line separated file of gene names (i.e. from need-curating.txt), scrapes NCBI for possible gene descriptions (I figure if they're already on NCBI, should probably pass tbl2asn..) and lets the user pick one. Output is a 2 column TSV that can be used with update-gene2products.py. Also edited to make automated version which just picks the most common description from the NCBI search (https://github.com/gamcil/annotation_scripts/blob/master/auto_gene_descriptions.py).

nextgenusfs commented 6 years ago

Nice, I'll give it a try. So basically I built the Gene2Products database by scrapping NCBI annotation for all fungal genomes -- what I found is that there are a crazy number of descriptions for each gene name and I think it is indicative of how the "rules" have changed over the years from NCBI as they have not (at least that I know of) adopted a standardized set of rules (there are of course some rules that funannotate tries to work around) but you end up with many different descriptions for the same gene. Part of why I thought Gene2Products would be a nice idea is for us to curate names/descriptions and have some consistency between annotations.

gamcil commented 6 years ago

I can see that - for most of the gene names I've tried there seems to be at least ~5 different descriptions that are roughly the same with slight differences (e.g. A instead of alpha, II instead of subunit 2 etc..). So it definitely still needs some manual intervention to clean up, and it would be nice to define some nomenclature standards to adhere to. However I'm finding the NCBI annotations are often still a better starting point - in the new-names-passed.txt file I'm getting a lot of lacking descriptions like )-reductase or aminotransferase, single letters or other weird crops of bigger descriptions. For this purpose I think my scripts approach is probably reasonable, i.e. query the gene database for the gene name, take a large pool of the results and determine the most common description (the vast majority of records seem to usually have the same one) before manual curation.

nextgenusfs commented 6 years ago

Yes some of the longer descriptions come from EggNog -- and some of the ones that look truncated come from the challenging of parsing UniProtKb/SwissProt definitions -- it would be best if they were cleaned up at their source (EggNog/UniProtkb) but I don't know if that is realistic or not or where those names/definitions were derived from. If this works well I can incorporate your scripts into perhaps the funannotate util menu as a way to look up gene names (if that is okay with you of course).

gamcil commented 6 years ago

I think it would be a monumental task to try and clean up each database - only thing we can really do is try and get Gene2Products off the ground so we have more and more mappings of names to reasonable descriptions for our annotations. Happy for you to do whatever you see fit with the scripts :+1:

hyphaltip commented 6 years ago

I’ve been fixing a subset of truncated that made their way from eggnog into gene2annotation but mainly when they got flagged in NCBI submission. I went in and fixed some inconsistency in chitin synthase numbering etc, Though in some cases inconsistency because there isn’t a HUGO style gene naming or product description guidelines.

Jason

Jason E Stajich, PhD Professor and Director, Microbiology Graduate Program Department of Microbiology and Plant Pathology University of California, Riverside http://lab.stajich.org @stajichlab @hyphaltip @zygolife +1 951.827.2363

On May 29, 2018, 7:18 AM -0700, Jon Palmer notifications@github.com, wrote:

Yes some of the longer descriptions come from EggNog -- and some of the ones that look truncated come from the challenging of parsing UniProtKb/SwissProt definitions -- it would be best if they were cleaned up at their source (EggNog/UniProtkb) but I don't know if that is realistic or not or where those names/definitions were derived from. If this works well I can incorporate your scripts into perhaps the funannotate util menu as a way to look up gene names (if that is okay with you of course). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.