tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
835 stars 226 forks source link

Modify product #177

Closed haruosuz closed 7 years ago

haruosuz commented 8 years ago

DDBJ suggested to Modify product names as follows:

?-D-glucose-1-phosphatase ->

lipopolysaccharide kinase (Kdo/WaaP) family protein -> lipopolysaccharide kinase Kdo/WaaP family protein

exo-alpha-(1->6)-L-arabinopyranosidase -> exo-alpha-(1->6)-L-arabinofuranosidase

glutamine amidotransferases class-II -> glutamine amidotransferase class-II

possibl zinc metallo-peptidase -> possible zinc metallopeptidase

haruosuz commented 8 years ago

DDBJ suggested to modify product names as follows:

?-D-glucose-1-phosphatase -> alpha-D-glucose-1-phosphatase

poly(hydroxyalcanoate) granule associated protein -> poly(hydroxyalkanoate) granule associated protein

tseemann commented 8 years ago

[18:44:15] Modify product: Transcription termination factor Rho => hypothetical protein

[18:44:15] Modify product: Transcription termination/antitermination protein NusA => hypothetical protein

tseemann commented 7 years ago

@haruosuz Do you know which databases those annotations came from? eg. SwissProt, Pfam, CDD, etc?

haruosuz commented 7 years ago

?-D-glucose-1-phosphatase is from TrEMBL http://www.uniprot.org/uniprot/A0A0P1GEW1

UniProtKB/Swiss-Prot Biocurator stated as follows: Yes,'?-D-glucose-1-phosphatase' should be 'Alpha-D-glucose-1-phosphatase'. As you may know, UniProtKB is comprised of two sections, Swiss-Prot (whose entries are manually biocurated) and TrEMBL (whose entries receive an automated annotation). Thus, the TrEMBL entry A0A0P1GEW1 had been annotated by automated means, based on data submitted to ENA (http://www.ebi.ac.uk/ena/data/view/CYSD01000037&display=text) where you can see the erroneous '?' in the product name.

tseemann commented 7 years ago

Yes, I understand that uniprot = swissprot(curated) + trembl(assimilated).

The examples you give are just bad annotations, spelling and terminology in protein names. Prokka comes with swissprot, which has been curated and does not have these problems.

If you want to use TrEMBL then I can not make Prokka learn all the exceptions. You should curate TrEMBL before providing it to Prokka.

haruosuz commented 7 years ago

I misunderstood. /product="?-D-glucose-1-phosphatase" came from /inference="protein motif:CLUSTERS:PRK09456"

find prokka-1.11/db/hmm -name "*.hmm" | xargs grep "PRK09456"
Binary file prokka-1.11/db/hmm/CLUSTERS.hmm matches
haruosuz commented 7 years ago

Here are comments by DDBJ curators:

The following EC_numbers are not valid.

1.14.13.-
5.99.1.-

The first letter of product names should be changed to lowercase except for person names and abbreviations.

Ankyrin repeat protein
Transposase

Descriptions in parenthesis should be moved to the note

Pentapeptide repeats (8 copies)
Periplasmic protein TonB (fragment)