opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Add xrefs to disease/phenotype profile page #1371

Closed d0choa closed 3 years ago

d0choa commented 3 years ago

Following on the great work on drug references, we have also included disease references in the API (example query)

All the possible prefixes that will come out from the API are the next:

>>> disease.select(explode("dbXRefs").alias("dbXRefs")).withColumn("dbXRefs", split("dbXRefs", ":").getItem(0)).distinct().show(100, truncate = False)
+----------------------+
|dbXRefs               |
+----------------------+
|OMIMPS                |
|PMID                  |
|NCIm                  |
|SNOMEDCT_2010_1_31    |
|UMLS                  |
|OAE                   |
|MFOMD                 |
|MedlinePlus           |
|NCIT                  |
|COHD                  |
|Orphanet              |
|ONCOTREE              |
|MEDDRA                |
|ORDO                  |
|DC                    |
|GTR                   |
|MTH                   |
|PERSON                |
|ICD9                  |
|GO                    |
|EV                    |
|UMLS_CUI              |
|MeSH                  |
|NCI                   |
|HP                    |
|SNOMEDCT_US           |
|SCTID_2010_1_31       |
|ATC_code              |
|NCI_Thesaurus         |
|OMIM                  |
|http                  |
|KEGG                  |
|ISBN-13               |
|ICD9CM                |
|MO                    |
|Wikipedia             |
|ICD10                 |
|ICD10CM               |
|SYMP                  |
|UniProt               |
|REACTOME              |
|Reactome              |
|NDFRT                 |
|SCDO                  |
|EFO                   |
|MONDO                 |
|NCiT                  |
|MedGen                |
|CSP                   |
|OMIT                  |
|SNOMEDCT              |
|SNOMED                |
|Wikidata              |
|MEDGEN                |
|url                   |
|NIFSTD                |
|NCIt                  |
|ISBN                  |
|ICD11                 |
|DOID                  |
|MSH                   |
|ISBN-10               |
|OBI                   |
|MedDRA                |
|MetaCyc               |
|ICD-10                |
|DERMO                 |
|modelled on http      |
|SCTID                 |
|MESH                  |
|CMO                   |
|IDO                   |
|EPCC                  |
|Fyler                 |
|GARD                  |
|MP                    |
|HGNC                  |
|ICDO                  |
|SNOMEDCT_US_2018_03_01|
|MeDRA                 |
|RESID                 |
|https                 |
+----------------------+

Examples with many different prefixes on them:

>>> disease.select("id", explode("dbXRefs").alias("dbXRefs")).withColumn("dbXRefs", split("dbXRefs", ":").getItem(0)).distinct().groupBy("id").count().sort(col("count").desc()).show(10)
+-----------+-----+
|         id|count|
+-----------+-----+
|EFO_0000253|   20|
|EFO_0000221|   19|
|EFO_0000407|   18|
|EFO_0000574|   18|
|EFO_0000309|   18|
|EFO_0001376|   18|
|EFO_0000248|   18|
|EFO_0000333|   18|
|EFO_0000479|   18|
|EFO_0000538|   18|
+-----------+-----+
only showing top 10 rows

Happy to help to prioritise prefixes.

d0choa commented 3 years ago

A shortlist of potentially interesting ones: MONDO, MeSH, NCIt, MEDDRA, UMLS, Orphanet

andrewhercules commented 3 years ago

The GraphQL API now provides various database IDs for disease/phenotypes and we can use these IDs to construct new cross-reference links. This will not only benefit users who may be familiar with other identifiers, but it will also benefit our search engine optimisation and domain and page authority rankings.

For example, the GraphQL API returns the following data based on a query for rheumatoid arthritis:

{
  "data": {
    "disease": {
      "dbXRefs": [
        "SNOMEDCT:69896004",
        "NCIT:C2884",
        "ICD10:M05",
        "KEGG:05323",
        "ICD10:M06.9",
        "MSH:D001172",
        "MONDO:0008383",
        "COHD:80809",
        "ICD9:714.0",
        "NCIt:C2884",
        "UMLS:C0003873",
        "ICD10:M06",
        "OMIM:180300",
        "SCTID:69896004",
        "OMIM:604302",
        "MESH:D001172",
        "DOID:7148"
      ]
    }
  }
}

As mentioned by @d0choa, we will display cross-reference links to MONDO, MeSH, NCIt, MEDDRA, UMLS, Orphanet, ICD10, and OMIM.

The layout will be the same as the drug profile page cross-references noted in #1356, with the name of the database followed by the ID that acts as a link to the database. The name for each database is before the colon : and the ID for the database is after the colon :.

Using the table below, please implement cross-reference links on the disease/phenotype profile page.

Note: for the purposes of the spec, the ID is noted as xRefId.

Source (from API response) URL structure Example
MONDO http://purl.obolibrary.org/obo/MONDO_ + xRefId MONDO: 0008383
MeSH https://identifiers.org/mesh: + xRefId MeSH: D001172
NCIt https://identifiers.org/ncit: + xRefId NCIt: C2884
MedDRA https://identifiers.org/meddra: + xRefId MedDRA: 10002026
UMLS https://identifiers.org/umls: + xRefId UMLS: C0021390
Orphanet https://identifiers.org/orphanet: + xRefId Orphanet: 85163
ICD10 https://identifiers.org/icd: + xRefId ICD10: I42.1

Please note that diseases will not have all cross-references (e.g. rheumatoid arthritis does not have an Orphanet entry)

d0choa commented 3 years ago

Most recurrent (in different diseases) normalised resources accounting for:

>>> disease.select("id", F.explode("dbXRefs").alias("dbXRefs")).withColumn("dbXRefs", F.lower(F.split("dbXRefs", ":").getItem(0))).distinct().groupBy("dbXRefs").count().sort(F.col("count").desc()).show(50)
+-----------+-----+
|    dbXRefs|count|
+-----------+-----+
|       umls|10282|
|      mondo| 8145|
|      icd10| 7429|
|      sctid| 6463|
|       mesh| 5997|
|       ncit| 5486|
|       doid| 5435|
|       omim| 5225|
|       gard| 3817|
|       icd9| 2841|
|     meddra| 2611|
|   orphanet| 1606|
|        efo| 1482|
|       pmid| 1181|
|   snomedct| 1059|
|       cohd|  985|
|snomedct_us|  724|
|  wikipedia|  702|
|        fma|  514|
|      emapa|  502|
|       icdo|  482|
|     omimps|  474|
|        msh|  473|
|        zfa|  467|
|         ma|  434|
|         hp|  427|
|   oncotree|  425|
|        bto|  395|
|        tao|  339|
|       vhog|  309|
|       gaid|  280|
|     caloha|  276|
|     ehdaa2|  239|
|        aao|  230|
|    opencyc|  225|
|        xao|  216|
|        mat|  205|
|         ev|  197|
|       fbbt|  181|
|       miaa|  171|
|      ehdaa|  170|
|      galen|  168|
|         dc|  138|
|       http|  118|
|      https|   87|
|    birnlex|   82|
|       bams|   77|
|       dhba|   67|
|       ordo|   54|
|        hba|   45|
+-----------+-----+
only showing top 50 rows
andrewhercules commented 3 years ago

As noted by @d0choa, there is data duplication caused by differences in spelling and capitalisation. For example, the following response contains two NCIt entries that have the same ID — one "NCIT", the other "NCIt".

{
  "data": {
    "disease": {
      "dbXRefs": [
        "SNOMEDCT:69896004",
        "NCIT:C2884",
        "ICD10:M05",
        "KEGG:05323",
        "ICD10:M06.9",
        "MSH:D001172",
        "MONDO:0008383",
        "COHD:80809",
        "ICD9:714.0",
        "NCIt:C2884",
        "UMLS:C0003873",
        "ICD10:M06",
        "OMIM:180300",
        "SCTID:69896004",
        "OMIM:604302",
        "MESH:D001172",
        "DOID:7148"
      ]
    }
  }
}

Before constructing the cross-reference links, can we please take the source value — the content before the colon : — and normalise by transforming to lowercase? Then, we can take the first instance where the normalised source string is one of "mondo", "mesh", "ncit", "meddra", "umls", "orphanet", "icd10", or "omim", and use the ID in the web interface and to construct the relevant link.

Source Normalised source string URL structure Example
MONDO mondo http://purl.obolibrary.org/obo/MONDO_ + xRefId MONDO: 0008383
MeSH mesh https://identifiers.org/mesh: + xRefId MeSH: D001172
NCIt ncit https://identifiers.org/ncit: + xRefId NCIt: C2884
MedDRA meddra https://identifiers.org/meddra: + xRefId MedDRA: 10002026
UMLS umls https://identifiers.org/umls: + xRefId UMLS: C0021390
Orphanet orphanet https://identifiers.org/orphanet: + xRefId Orphanet: 85163
ICD10 icd10 https://identifiers.org/icd: + xRefId ICD10: I42.1
OMIM omim https://www.omim.org/entry/ + xRefId OMIM: 180300