uniprot / enzymeportal

The EBI Enzyme Portal
http://www.ebi.ac.uk/enzymeportal/
Apache License 2.0
11 stars 4 forks source link

CHEBI compounds #102

Closed rafael-alcantara closed 11 years ago

rafael-alcantara commented 11 years ago

Retrieve the appropriate ChEBI compounds and populate the mega mapper.

Instead of populating the mega-map from the EB-Eye XML file, we have to take xrefs ChEBI-!UniProt from the ChEBI database which includes the location_in_ref (i.e. might give a defined relationship between the ChEBI compound and the !UniProt protein, and some semantics to the xref).

rafael-alcantara commented 11 years ago

Author: ralcantara We have to map the existing values of the location_in_xref column (i.e. the type of [http://web.expasy.org/docs/userman.html line] where the compound has been found) to a Relationship enum value. These are the interesting ones for us:

'''CC - CATALYTIC ACTIVITY''' Description of the reaction(s) catalyzed by an enzyme. This is definitely insteresting and never misleading. ''Use a new Relationship "is_substrate_or_product_of".''

'''CC - COFACTOR''' Description of any non-protein substance required by an enzyme for its catalytic activity. No mistake here either. ''Use existing "is_cofactor_of" Relationship.''

'''CC - ENZYME REGULATION''' Description of an enzyme regulatory mechanism. These cross references can be misleading. For example, in this case:

    CC   -!- ENZYME REGULATION: The activity of this enzyme is controlled by
    CC       adenylation under conditions of abundant glutamine. The fully
    CC       adenylated enzyme complex is inactive (By similarity).

glutamine is not regulating the enzyme, it is adenylation who does. So we could not say that 'glutamine regulates foobar'. In this other case:

    CC   -!- ENZYME REGULATION: Completely inhibited by Hg(2+), partially
    CC       inhibited by Mn(2+), Cu(2+) and Pb(2+). Unaffected by Ca(2+),
    CC       Mg(2+) and EDTA.

neither Ca(2+) nor Mg(2+) nor EDTA regulate the enzyme. ''Use generic is_related_to relationship.''

'''CC - INDUCTION''' Description of the compound(s) or condition(s) that regulate gene expression. We may have a xref from a ChEBI compound which does not actually induces the expression:

    CC   -!- INDUCTION: By heat shock, salt stress, oxidative stress, glucose
    CC       limitation and oxygen limitation.

Here, it is not glucose nor oxygen which induce the expression, but the lack of them. So it would be not only inaccurate and misleading, but also wrong to store in our mega-map something like 'glucose induces_expression_of foobar'. ''Use generic is_related_to relationship.''

'''CC - PHARMACEUTICAL''' Description of the use of a protein as a pharmaceutical drug. As this refers to the protein as a drug, any compound mentioned here might be anything. ''Use generic is_related_to relationship.''

'''CC - PTM''' Description of any chemical alternation of a polypeptide (proteolytic cleavage, amino acid modifications including crosslinks). This topic complements information given in the feature table or indicates polypeptide modifications for which position-specific data is not available. In the examples from [http://web.expasy.org/docs/userman.html#CC_line UniProt documentation] we can only say that this is a free text line, where a compound name may have very different semantics. ''Use generic is_related_to relationship.''

'''CC - TOXIC DOSE''' Description of the lethal dose (LD), paralytic dose (PD) or effective dose of a protein. This refers to the protein, not to any compound (much like the PHARMACEUTICAL line). ''Use generic is_related_to relationship.''

So I am afraid that the free text nature of CC lines makes very difficult to extract semantic information for us. At least, we can distinguish cofactors, reactants and products. But we should contact !UniProt to clarify whether it is possible to retrieve a list of activators, inhibitors or drugs from them.

In the search results filter, we should show for now cofactors, substrates, products (ChEBI compounds) and drugs (ChEMBL compounds).