Closed d0choa closed 3 years ago
32 out of 35 HPA locations have been mapped to the SL ontology by @d0choa: https://docs.google.com/spreadsheets/d/1strBIqGn9pFIJlOU8CAw8PD8ihBziBVzcbjIbjWyEA4/edit?usp=sharing
HPA_location
is the label from HPA.termSL
is the mapped SL ID that will be added to the subcellularLocations
object. Null if no mapping is present.labelSL
is the label of the mapped SL ID. This term won't be used to replace HPA_location
. Null if no mapping is present.A static file has been generated from the spreadsheet and uploaded to a new bucket:
gs://otar001-core/subcellularLocations/HPA_subcellular_locations_SL-2021-08-19.tsv
PIS will have to pick the latest file with the filename: HPA_subcellular_locations_SL-YYYY-MM-DD.tsv
Let me know if another format is preferred.
After investigation of the subcellular location data extracted from Uniprot and HPA, we observed it could be of additional value to extract the SL ontology identifiers available from Uniprot.
The subcellular locations are currently included in the target object including the
location
and thesource
. BRAF query response:Each uniprot location is captured also as an ontology term. For example, for BRAF is annotated with:
We would like to have the SL identifiers as an extra field in the
subcellularLocations
objects.The data is however not very friendly. Information is made available through the Uniprot FE, but absent in the text files. An example of the subcellular location in the txt file:
So the IDs are not really available in the Uniprot entries. Instead a lookup of all locations is available in the next location (page | downloadable file).
A preview of the top rows looks like the next:
What we would need to do is to map the alias to the ID and display this as new field named
id
(or something similar)Several cases will require extra logic:
Cytoplasm, cytoskeleton, microtubule organizing center, centrosome
(ENSG00000008086|CDKL5). In these cases the annotation is really the last termcentrosome
. The rest is a string representation of all the ancestors. My vote would go to keep the last termcentrosome
aslocation
and map only that term to the respective ID.Mitochondrion outer membrane; Single-pass type IV membrane protein; Cytoplasmic side
(ENSG00000069535|MAOB). In this case, the terms refer to the 3 differentCategory
classes a SL ontology annotation can belong. By orderCellular component
,Topology
andOrientation
. We are only really interested in theCellular component
so I would remove the other annotations when a semicolon is present. The first element in the semicolon-separated list can as well belong to case 1 (comma-separated).[Isoform 5]: Cytoplasm
specify the isoform. We want to keep this information in thelocation
field. However, to get it mapped, the isoform needs to be ignored and use only theCytoplasm
stringThe end game here is to be able to use a visualisation like this.