Closed haruosuz closed 7 years ago
DDBJ suggested to modify product names as follows:
?-D-glucose-1-phosphatase -> alpha-D-glucose-1-phosphatase
poly(hydroxyalcanoate) granule associated protein -> poly(hydroxyalkanoate) granule associated protein
[18:44:15] Modify product: Transcription termination factor Rho => hypothetical protein
[18:44:15] Modify product: Transcription termination/antitermination protein NusA => hypothetical protein
@haruosuz Do you know which databases those annotations came from? eg. SwissProt, Pfam, CDD, etc?
?-D-glucose-1-phosphatase is from TrEMBL http://www.uniprot.org/uniprot/A0A0P1GEW1
UniProtKB/Swiss-Prot Biocurator stated as follows: Yes,'?-D-glucose-1-phosphatase' should be 'Alpha-D-glucose-1-phosphatase'. As you may know, UniProtKB is comprised of two sections, Swiss-Prot (whose entries are manually biocurated) and TrEMBL (whose entries receive an automated annotation). Thus, the TrEMBL entry A0A0P1GEW1 had been annotated by automated means, based on data submitted to ENA (http://www.ebi.ac.uk/ena/data/view/CYSD01000037&display=text) where you can see the erroneous '?' in the product name.
Yes, I understand that uniprot = swissprot(curated) + trembl(assimilated).
The examples you give are just bad annotations, spelling and terminology in protein names. Prokka comes with swissprot, which has been curated and does not have these problems.
If you want to use TrEMBL then I can not make Prokka learn all the exceptions. You should curate TrEMBL before providing it to Prokka.
I misunderstood. /product="?-D-glucose-1-phosphatase" came from /inference="protein motif:CLUSTERS:PRK09456"
find prokka-1.11/db/hmm -name "*.hmm" | xargs grep "PRK09456"
Binary file prokka-1.11/db/hmm/CLUSTERS.hmm matches
Here are comments by DDBJ curators:
The following EC_numbers are not valid.
1.14.13.-
5.99.1.-
The first letter of product names should be changed to lowercase except for person names and abbreviations.
Ankyrin repeat protein
Transposase
Descriptions in parenthesis should be moved to the note
Pentapeptide repeats (8 copies)
Periplasmic protein TonB (fragment)
DDBJ suggested to Modify product names as follows:
?-D-glucose-1-phosphatase ->
lipopolysaccharide kinase (Kdo/WaaP) family protein -> lipopolysaccharide kinase Kdo/WaaP family protein
exo-alpha-(1->6)-L-arabinopyranosidase -> exo-alpha-(1->6)-L-arabinofuranosidase
glutamine amidotransferases class-II -> glutamine amidotransferase class-II
possibl zinc metallo-peptidase -> possible zinc metallopeptidase