opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Annotate uniprot subcellular locations with SL ontology IDs #1710

Closed d0choa closed 3 years ago

d0choa commented 3 years ago

After investigation of the subcellular location data extracted from Uniprot and HPA, we observed it could be of additional value to extract the SL ontology identifiers available from Uniprot.

The subcellular locations are currently included in the target object including the location and the source. BRAF query response:

{
  "data": {
    "target": {
      "id": "ENSG00000157764",
      "approvedSymbol": "BRAF",
      "subcellularLocations": [
        {
          "location": "Cell membrane",
          "source": "uniprot"
        },
        {
          "location": "Nucleus",
          "source": "uniprot"
        },
        {
          "location": "Cytoplasm",
          "source": "uniprot"
        },
        {
          "location": "Vesicles",
          "source": "HPA_main"
        },
        {
          "location": "Cytosol",
          "source": "HPA_additional"
        }
      ]
    }
  }
}

Each uniprot location is captured also as an ontology term. For example, for BRAF is annotated with:

We would like to have the SL identifiers as an extra field in the subcellularLocations objects.

The data is however not very friendly. Information is made available through the Uniprot FE, but absent in the text files. An example of the subcellular location in the txt file:

CC   -!- SUBCELLULAR LOCATION: Nucleus {ECO:0000250}. Cytoplasm
CC       {ECO:0000269|PubMed:19710016}. Cell membrane
CC       {ECO:0000269|PubMed:19710016}. Note=Colocalizes with RGS14 and RAF1 in
CC       both the cytoplasm and membranes. {ECO:0000250}.

So the IDs are not really available in the Uniprot entries. Instead a lookup of all locations is available in the next location (page | downloadable file).

A preview of the top rows looks like the next:

Subcellular location ID Description Category    Alias
SL-0476 The appearance of the striated muscle is created by a pattern of alternating dark A bands and light I bands. A bands comprise thick filaments of myosin and proteins that bind myosin. They are bisected by the H zone, a paler region where the thick and the thin filaments do not overlap. The exact center of the A band is termed the M line.  Cellular component  A band
SL-0002 The acidocalcisome is an electron-dense acidic organelle which contains a matrix of pyrophosphate and polyphosphates with bound calcium and other cations. Its limiting membrane possesses a number of pumps and exchangers for the uptake and release of these elements. The acidocalcisome does not belong to the endocytic pathway and may represent a branch of the secretory pathway in trypanosomatids and apicomplexan parasites. The acidocalcisome is possibly involved in polyphosphate and cation storage and in adaptation of these microoganisms to environmental stress.  Cellular component  Acidocalcisome
SL-0316 The acidocalcisome compartment bounded by the acidocalcisomal membrane. Cellular component  Acidocalcisome lumen
SL-0003 The membrane of an acidocalcisome.  Cellular component  Acidocalcisome membrane
SL-0007 The acrosome is a large lysosome-like vesicle overlying the sperm nucleus. This spermatid specific organelle, derived from the Golgi during spermatogenesis, contains both unique acrosomal enzymes and common enzymes associated with lysosomes in somatic cells. Only sperm that have undergone the acrosome reaction can fuse with egg plasma membrane. The acrosome reaction is characterized by multiple fusions of the outer acrosomal membrane with the sperm cell membrane. Cellular component  acrosome
SL-0004 The portion of the acrosomal membrane closely associated with the anterior region of the nuclear envelope.  Cellular component  acrosome inner membrane
SL-0005 The lumen of the acrosome.  Cellular component  acrosome lumen
SL-0006 The membrane of the acrosome.   Cellular component  acrosome membrane
SL-0447 The portion of the acrosomal membrane just beneath the sperm cell membrane. Cellular component  acrosome outer membrane
SL-0008 The actin patch is a highly dynamic actin structure in fungi required primarily for endocytosis but possibly also coupled to exocytosis. Actin patches are highly motile, they first assemble at sites of polarized cell growth and then move slowly and nondirectionally along the cell cortex.    Cellular component  actin patch

What we would need to do is to map the alias to the ID and display this as new field named id (or something similar)

Several cases will require extra logic:

  1. Comma-separated locations. Example: Cytoplasm, cytoskeleton, microtubule organizing center, centrosome (ENSG00000008086|CDKL5). In these cases the annotation is really the last term centrosome. The rest is a string representation of all the ancestors. My vote would go to keep the last term centrosome as location and map only that term to the respective ID.
  2. Semicolons - Example - Mitochondrion outer membrane; Single-pass type IV membrane protein; Cytoplasmic side (ENSG00000069535|MAOB). In this case, the terms refer to the 3 different Category classes a SL ontology annotation can belong. By order Cellular component, Topology and Orientation. We are only really interested in the Cellular component so I would remove the other annotations when a semicolon is present. The first element in the semicolon-separated list can as well belong to case 1 (comma-separated).
  3. Square brackets (ENSG00000012048 | BRCA1). Some labels [Isoform 5]: Cytoplasm specify the isoform. We want to keep this information in the location field. However, to get it mapped, the isoform needs to be ignored and use only the Cytoplasm string
  4. To increase the value of the dataset. We could map HPA location annotations to the SL ontology as well. The total list of HPA is 35 locations and they will rarely change. It's not a crazy effort. Between @ireneisdoomed and I can probably come out with a decent mapping. The current list of options next:
>>> target.withColumn("loc", F.explode("subcellularLocations").alias("loc")).select("id", "approvedSymbol", "loc.*").filter(F.col("source") == "HPA_main").select("source", "location").distinct().show(35, truncate=False)
+--------+-------------------------+
|source  |location                 |
+--------+-------------------------+
|HPA_main|Peroxisomes              |
|HPA_main|Nucleoli rim             |
|HPA_main|Cleavage furrow          |
|HPA_main|Focal adhesion sites     |
|HPA_main|Vesicles                 |
|HPA_main|Nucleoli                 |
|HPA_main|Midbody                  |
|HPA_main|Rods & Rings             |
|HPA_main|Plasma membrane          |
|HPA_main|Microtubules             |
|HPA_main|Nucleoplasm              |
|HPA_main|Midbody ring             |
|HPA_main|Cytoplasmic bodies       |
|HPA_main|Intermediate filaments   |
|HPA_main|Nuclear membrane         |
|HPA_main|Mitotic spindle          |
|HPA_main|Aggresome                |
|HPA_main|Endoplasmic reticulum    |
|HPA_main|Endosomes                |
|HPA_main|Mitochondria             |
|HPA_main|Centrosome               |
|HPA_main|Cytosol                  |
|HPA_main|Nuclear speckles         |
|HPA_main|Lysosomes                |
|HPA_main|Lipid droplets           |
|HPA_main|Golgi apparatus          |
|HPA_main|Nucleoli fibrillar center|
|HPA_main|Mitotic chromosome       |
|HPA_main|Actin filaments          |
|HPA_main|Kinetochore              |
|HPA_main|Centriolar satellite     |
|HPA_main|Cell Junctions           |
|HPA_main|Nuclear bodies           |
|HPA_main|Cytokinetic bridge       |
|HPA_main|Microtubule ends         |
+--------+-------------------------+

The end game here is to be able to use a visualisation like this.

ireneisdoomed commented 3 years ago

Update on point 4

32 out of 35 HPA locations have been mapped to the SL ontology by @d0choa: https://docs.google.com/spreadsheets/d/1strBIqGn9pFIJlOU8CAw8PD8ihBziBVzcbjIbjWyEA4/edit?usp=sharing

A static file has been generated from the spreadsheet and uploaded to a new bucket:

gs://otar001-core/subcellularLocations/HPA_subcellular_locations_SL-2021-08-19.tsv

PIS will have to pick the latest file with the filename: HPA_subcellular_locations_SL-YYYY-MM-DD.tsv

Let me know if another format is preferred.