glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Changes to EC numbers on the protein pages #1183

Open katewarner opened 4 months ago

katewarner commented 4 months ago

I found an issue with the EC numbers on the protein pages and I have some suggestions for changes to how the EC numbers are displayed on the front end that we could discuss during the general meeting:

  1. On the protein pages I noticed that the EC numbers in the Names section (e.g. https://www.glygen.org/protein/Q9VJ81-1#Names), so in this case it's 2.4.1.17 in the picture below, has broken links. image
  2. The EC number is a classification system for enzymes that describes the reaction the enzyme catalyses so I think it would be more informative for the user if the EC link took you to Rhea (e.g. https://www.rhea-db.org/rhea?query=ec:2.4.1.17) or enzyme.expasy (e.g. https://enzyme.expasy.org/EC/2.4.1.17), rather than UniProt.
  3. I think it would be better if it was displayed on the front end as EC: 2.4.1.17 or UniProtKB: EC: 2.4.1.17 rather than UniProtKB: 2.4.1.17, as just the number linked to UniProt doesn't really make sense unless you're familiar with EC numbers which not all users will be.
  4. The data is being pulled in as a name but I think it would be better if this information was displayed in the "General" or "Function" section of the protein page. This is because the EC number is not a synonym of the "Protein name", it's the recommended name and description of the reaction it catalyses/activates.
ReneRanzinger commented 3 months ago

At the weekly meeting we decided that we will not us the EC numbers as names. Instead they will become part of the functional annotation section.

We need to solve one problem: We could make the text of the function "EC:[ec number] - [accepted name]". However links to CaZY or Expasy Enzyme can not be the evidence links because they did not provide this annotation ... UniProt did. So UniProt is still the evidence badge. Question for @katewarner, @rykahsay and @sujeetvkulkarni is how can we integrate CaZY or Expasy enzymes links?

ReneRanzinger commented 3 months ago

We could have a separate EC number array in the JSON. Each entry has:

The frontend would know that these need to be added to the function part and format into "EC:[ec number] - [accepted name] (see Expasy Enzyme link or Cazy link)" and evidence bade from the evidence array.

This will require to change the JSON structure of the protein details.

rykahsay commented 2 months ago

@pkay47 --- why are some Brenda xrefs (EC numbers) not integrated into UniProt? For example, there is brenda xref connecting P22674 and "ec-3.2.2.27" as shown below

$ cat downloads/ebi/current/uniprot-proteome-homo-sapiens.nt | grep "3\.2\.2\.27" | grep P22674

<http://purl.uniprot.org/uniprot/P22674> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://purl.uniprot.org/brenda/3.2.2.27> .

But, as shown below, there is no "http://purl.uniprot.org/core/enzyme" predicate connecting "P22674" and "ec-3.2.2.27"

$ cat downloads/ebi/current/uniprot-proteome-homo-sapiens.nt | grep "3\.2\.2\.27"  | grep "<http://purl.uniprot.org/core/enzyme>"

<http://purl.uniprot.org/uniprot/P13051> <http://purl.uniprot.org/core/enzyme> <http://purl.uniprot.org/enzyme/3.2.2.27> .
<http://purl.uniprot.org/uniprot/A0A8V8TPS1> <http://purl.uniprot.org/core/enzyme> <http://purl.uniprot.org/enzyme/3.2.2.27> .
<http://purl.uniprot.org/uniprot/A0A8V8TQ66> <http://purl.uniprot.org/core/enzyme> <http://purl.uniprot.org/enzyme/3.2.2.27> .
<http://purl.uniprot.org/uniprot/A0A8V8TNE1> <http://purl.uniprot.org/core/enzyme> <http://purl.uniprot.org/enzyme/3.2.2.27> .
<http://purl.uniprot.org/uniprot/A0A8V8TNJ5> <http://purl.uniprot.org/core/enzyme> <http://purl.uniprot.org/enzyme/3.2.2.27> .
<http://purl.uniprot.org/uniprot/A0A8V8TNW2> <http://purl.uniprot.org/core/enzyme> <http://purl.uniprot.org/enzyme/3.2.2.27> .
<http://purl.uniprot.org/uniprot/F5GYA2> <http://purl.uniprot.org/core/enzyme> <http://purl.uniprot.org/enzyme/3.2.2.27> .
pkay47 commented 2 months ago

image

http://purl.uniprot.org/core/enzyme is present when EC is in protein names. UniProt help: https://www.uniprot.org/help/protein_names

xref identifiers could be anything, uniprot accession or xref_db_specific_id or EC number. It depends on xref database. Brenda xref_id contains EC number.

In P22674, there is brenda xref, but has no EC in its name. So no http://purl.uniprot.org/core/enzyme

image

katewarner commented 2 months ago

It looks like all of the human entries are reviewed UniProtKB entries, which means a curator has looked at at the entries and doesn't think there is enough evidence to support the EC numbers, whereas Brenda (https://www.brenda-enzymes.org/advanced.php) is interested in enzyme families and EC numbers, and likely uses large scale analyses to map the EC numbers to proteins - This is why the UniProtKB entries don't have a EC in the entry but they have a Brenda EC xref.

So my suggestion, for enzymes in all organisms, would be to only display EC numbers in the Function section of GlyGen if they are in UniProtKB but keep the Brenda xrefs in the cross-references section of GlyGen, since future studies may determine that they are enzymes. But we can discuss this during the general meeting.

rykahsay commented 2 months ago

Since the downloaded nt files from EBI do not give connection between Rhea reaction IDs and EC numbers. This means for a given ec number "2.1.1.45" , I cannot create evidence URL=https://www.rhea-db.org/rhea/?query=ec:2.1.1.45 unless I know Rhea has a reaction ID mapping to "2.1.1.45".

In the feature, I want @pkay47 to add a predicate that connects Rhea/Reactome/... reaction IDs with EC-numbers.

For now, I am creating a new dataset file as follows:

Input: downloads/rhea/current/rhea-ec-iubmb.tsv Input_readme: downloads/rhea/current/README Output: reviewed/protein_reaction2ec_rhea.csv

With this, the protein detail APIs will have a new property called "enzyme_annotation" (example for P04818 is shown below)

"enzyme_annotation":[
    {
        "ec_number": "2.1.1.45",
        "ec_name": "(6R)-5,10-methylene-5,6,7,8-tetrahydrofolate + dUMP = 7,8-dihydrofolate + dTMP.",
        "evidence": [
            {
                "id": "2.1.1.45",
                "database": "Rhea",
                "url": "https://www.rhea-db.org/rhea/?query=ec:2.1.1.45"
            },
           {
                "id": "2.1.1.45",
                "database": "BRENDA Enzymes",
                "url": "https://www.brenda-enzymes.org/enzyme.php?ecno=2.1.1.45"
            }
        ]
    }
]
rykahsay commented 2 months ago

The API has now the "enzyme_annotation" section

image
pkay47 commented 2 months ago

@rykahsay could you please create a ticket for 'add a predicate that connects Rhea/Reactome/... reaction IDs with EC-numbers' with predicate name & which databases required? Also please update datamodel - https://docs.google.com/document/d/1MOtPk2wTVb2EL-u1DD2T86J-JX4nQTKPRdScakG-X08/edit#

rykahsay commented 1 month ago

Added:

image
rykahsay commented 2 weeks ago

@pkay47 ... I don't see the triples

$ cat downloads/ebi/current/uniprot-proteome-homo-sapiens.nt | grep  hasEnzyme
pkay47 commented 1 week ago

@rykahsay was planning the change to go in 2024_04 datasets, release is on 24-jul.

Do you want the current release to be updated? Only human or all datasets?