opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Search issue in platform development #2637

Closed d0choa closed 2 years ago

d0choa commented 2 years ago

When performing the next query, the API produces no results:

query SearchPageQuery {
  search(queryString: "surgical") {
    total
    hits {
      id
      highlights
      object {
        ... on Target {
          id
        }
        ... on Disease {
          id
        }
        ... on Drug {
          id
        }
      }
    }
  }
}
{
  "data": {
    "search": {
      "total": 189,
      "hits": []
    }
  }
}

This is not the same for other sources:

query SearchPageQuery {
  search(queryString: "braf") {
    total
    hits {
      id
      highlights
      object {
        ... on Target {
          id
        }
        ... on Disease {
          id
        }
        ... on Drug {
          id
        }
      }
    }
  }
}

With my magic ball, I suspect there is a problem with the medical procedures that were added to the disease index in this release. There must be something on the disease dataset or alternatively in the search ETL that it's messing with this.

JarrodBaker commented 2 years ago

The log output from that query is:

[error] m.ElasticRetriever - ((5)/_source/category,List(JsonValidationError(List(error.path.missing),List()))) | ((8)/_source/category,List(JsonValidationError(List(error.path.missing),List()))) 

Indicating that in SearchResults.scala there is a problem parsing the JSON coming from Elasticsearch. We expect that all entries have a non-null field category, which presumably isn't the case at present.

The query

GET /search_drug,search_target,search_disease/_search
{
"query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "category"
        }
      }
    }
  }
}

Indicating that there are 62 entries with no category field.

Of these:

search_disease = 55, these are all missing the field entirely. search_target = 0 search_drug = 7, these all have empty arrays.

Checking the disease index

GET /disease/_search
{
  "query": {
     "bool": {
      "must_not": {
        "exists": {
          "field": "therapeuticAreas"
        }
      }
    }
  }
}

Shows that there are indeed 55 entries missing the field in question.

The problem is presumably in the Disease step of the ETL.

d0choa commented 2 years ago

For context, this is the change that we have made that might have had unexpected consequences https://github.com/opentargets/issues/issues/2588

I didn't count them manually, but they are probably on the range of the offending disease terms

d0choa commented 2 years ago

Voilà

>>> diseases.withColumn("TAs", F.explode_outer("therapeuticAreas")).filter(F.col("TAs").isNull()).select("name", "TAs").show(100, truncate = False)
+----------------------------------------------+----+
|name                                          |TAs |
+----------------------------------------------+----+
|medical procedure                             |null|
|heart valve prosthesis                        |null|
|hysterectomy                                  |null|
|percutaneous transluminal coronary angioplasty|null|
|coronary artery bypass                        |null|
|circumcision                                  |null|
|Xenograft                                     |null|
|pancreatectomy                                |null|
|dissection                                    |null|
|invasive mechanical ventilation               |null|
|cardioverter defibrillator                    |null|
|response to intravenous immunoglobulin therapy|null|
|Therapeutic Procedure                         |null|
|prophylactic surgery                          |null|
|ileostomy                                     |null|
|renal dialysis                                |null|
|Pharmacotherapy                               |null|
|contraception                                 |null|
|blood transfusion                             |null|
|blood vessel replacement                      |null|
|cardiac transplant                            |null|
|dentures                                      |null|
|sedation                                      |null|
|total knee arthroplasty                       |null|
|radical prostatectomy                         |null|
|cardiac ablation                              |null|
|total hip arthroplasty                        |null|
|ophthalmic procedure                          |null|
|digestive system surgery                      |null|
|continuous positive airway pressure           |null|
|mastectomy                                    |null|
|artificial cardiac pacemaker                  |null|
|appendectomy                                  |null|
|rehabilitation                                |null|
|liver transplant                              |null|
|bone marrow transplantation                   |null|
|amputation                                    |null|
|checkup                                       |null|
|cesarean section                              |null|
|cornea transplantation                        |null|
|follow-up                                     |null|
|bilateral oophorectomy                        |null|
|revision of total knee arthroplasty           |null|
|dependence on enabling machines and devices   |null|
|revision of total hip arthroplasty            |null|
|cochlear implant                              |null|
|revision of total joint arthroplasty          |null|
|pregnancy test                                |null|
|total joint arthroplasty                      |null|
|cadaver dissection                            |null|
|gastric bypass                                |null|
|orthopedic nursing                            |null|
|surgery on leg artery                         |null|
|lung transplantation                          |null|
|kidney transplant                             |null|
+----------------------------------------------+----+
d0choa commented 2 years ago

EFO input looks similar to other root terms (e.g. phenotype). So everything points to the ETL

❯ gsutil cat  gs://open-targets-pre-data-releases/22.06.1/input/ontology-inputs/diseases_efo.jsonl | jq 'select(.id == "EFO_0002571")'
{
  "id": "EFO_0002571",
  "parentIds": [],
  "name": "medical procedure"
}

❯ gsutil cat  gs://open-targets-pre-data-releases/22.06.1/input/ontology-inputs/diseases_efo.jsonl | jq 'select(.id == "EFO_0000651")'
{
  "id": "EFO_0000651",
  "parentIds": [],
  "name": "phenotype"
}
d0choa commented 2 years ago

For an unknown reason medical procedure in the ETL outputs has "isTherapeuticArea": false when it should be true.

Full medical procedure record ``` ❯ gsutil cat 'gs://open-targets-pre-data-releases/22.06.1/output/etl/json/diseases/*.json' | jq 'select(.id == "EFO_0002571")' { "id": "EFO_0002571", "code": "http://www.ebi.ac.uk/efo/EFO_0002571", "dbXRefs": [ "NCIt:C25218", "SNOMEDCT:50731006", "ICD10:Z41", "NCIt:C79751" ], "description": "An activity that produces an effect, or that is intended to alter the course of a disease in a patient or population. This is a general term that encompasses the medical, social, behavioral, and environmental acts that can have preventive, therapeutic, or palliative effects.", "name": "medical procedure", "parents": [], "synonyms": { "hasExactSynonym": [ "Procedure", "Intervention Strategies", "interventionDescription", "Interventional", "Intervention", "SURGICAL AND MEDICAL PROCEDURES", "Intervention or Procedure" ] }, "ancestors": [], "descendants": [ "EFO_0009577", "EFO_0010682", "EFO_0010722", "EFO_0600086", "EFO_0010721", "EFO_0010720", "EFO_0010726", "EFO_0002581", "EFO_0020684", "EFO_0003942", "EFO_0009728", "EFO_0003906", "EFO_0009807", "EFO_0009806", "EFO_0009729", "EFO_0010690", "EFO_0009520", "EFO_0010134", "EFO_0009643", "EFO_0009642", "EFO_0005244", "EFO_0003856", "EFO_0020973", "EFO_0020974", "EFO_0003776", "EFO_0020972", "EFO_0009517", "EFO_0020975", "EFO_0009717", "EFO_0010065", "EFO_0009719", "EFO_0010064", "EFO_0020979", "EFO_0010063", "EFO_0009632", "EFO_0009636", "EFO_0020981", "EFO_0020988", "EFO_0020989", "EFO_0009868", "EFO_0020987", "EFO_0010674", "EFO_0010673", "EFO_0010672", "EFO_0010078", "EFO_0010676", "EFO_0009580", "EFO_0003881", "EFO_0010719", "EFO_0009581", "EFO_0600009", "EFO_0003953", "EFO_0003951", "EFO_0010681" ], "children": [ "EFO_0002581", "EFO_0003776", "EFO_0003856", "EFO_0003881", "EFO_0003906", "EFO_0003942", "EFO_0003951", "EFO_0003953", "EFO_0009517", "EFO_0009520", "EFO_0009577", "EFO_0009580", "EFO_0009581", "EFO_0009632", "EFO_0009636", "EFO_0009642", "EFO_0009643", "EFO_0009717", "EFO_0009719", "EFO_0009728", "EFO_0009729", "EFO_0009806", "EFO_0009868", "EFO_0010063", "EFO_0010078", "EFO_0010134", "EFO_0010672", "EFO_0010673", "EFO_0010674", "EFO_0010676", "EFO_0010682", "EFO_0010690", "EFO_0010719", "EFO_0010720", "EFO_0010721", "EFO_0010722", "EFO_0010726", "EFO_0020684", "EFO_0020975", "EFO_0020979", "EFO_0020981", "EFO_0020987", "EFO_0020988", "EFO_0020989", "EFO_0600009" ], "therapeuticAreas": [], "ontology": { "isTherapeuticArea": false, "leaf": false, "sources": { "url": "http://www.ebi.ac.uk/efo/EFO_0002571", "name": "EFO_0002571" } } } ```
JarrodBaker commented 2 years ago

The disease step is a little convoluted as it still has much of the logic in PIS, rather than having everything in the ETL. The EFO owl file is downloaded and converted to JSON using Riot as specified in the configuration variable disease.etl.efo.owl_jq.

The results of this conversion are saved under staging/ontology-inputs/efo_otar_slim.json.

This file is further manipulated to produce ontology-efo-v3.42.0.jsonl which is used as the input to the ETL. It is here that the value of isTherapeuticArea is set. The code checks whether a disease include the field oboInOwl:inSubset, and if so sets the isTherapeuticArea flag. The schema of the output file is:

root
 |-- code: string (nullable = true)
 |-- dbXRefs: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- definition: string (nullable = true)
 |-- definition_alternatives: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- id: string (nullable = true)
 |-- isTherapeuticArea: boolean (nullable = true)
 |-- label: string (nullable = true)
 |-- locationIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- obsoleteTerms: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- parents: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- sko: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- synonyms: struct (nullable = true)
 |    |-- hasBroadSynonym: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- hasExactSynonym: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- hasNarrowSynonym: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- hasRelatedSynonym: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

Looking at the entry for EFO_0009517:


    <!-- http://www.ebi.ac.uk/efo/EFO_0009517 -->

    <Class rdf:about="http://www.ebi.ac.uk/efo/EFO_0009517">
        <rdfs:subClassOf rdf:resource="http://www.ebi.ac.uk/efo/EFO_0002571"/>
        <obo:IAO_0000115>A general examination or inspection, especially one carried out by a doctor or dentist. [ NCI ]</obo:IAO_0000115>
        <oboInOwl:hasDbXref>ICD10:Z00</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref>ICD10:Z01</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref>ICD10:Z03</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref>ICD10:Z04</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref>ICD10:Z10</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref>ICD10:Z11</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref>ICD10:Z12</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref>ICD10:Z13</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref>NCIt:C41383</oboInOwl:hasDbXref>
        <oboInOwl:hasExactSynonym>check up</oboInOwl:hasExactSynonym>
        <oboInOwl:hasExactSynonym>check-up</oboInOwl:hasExactSynonym>
        <oboInOwl:hasExactSynonym>medical examination</oboInOwl:hasExactSynonym>
        <rdfs:label>checkup</rdfs:label>
    </Class>

There is no field <oboInOwl:inSubset> so PIS marks the entry as isTherapeuticArea = false.

d0choa commented 2 years ago

The consequence of the above is that medical procedure is not exposed as a therapeutic area to the ETL due to PIS missing the <oboInOwl:inSubset> tag

(base) base ❯ gsutil cat gs://open-targets-pre-data-releases/22.06.1/input/ontology-inputs/ontology-efo-v3.42.0.jsonl | jq 'select(.id == "EFO_0002571")'
{
  "id": "EFO_0002571",
  "code": "http://www.ebi.ac.uk/efo/EFO_0002571",
  "label": "medical procedure",
  "definition": "An activity that produces an effect, or that is intended to alter the course of a disease in a patient or population. This is a general term that encompasses the medical, social, behavioral, and environmental acts that can have preventive, therapeutic, or palliative effects.",
  "isTherapeuticArea": false,
  "synonyms": {
    "hasExactSynonym": [
      "Procedure",
      "Intervention Strategies",
      "interventionDescription",
      "Interventional",
      "Intervention",
      "SURGICAL AND MEDICAL PROCEDURES",
      "Intervention or Procedure"
    ]
  },
  "dbXRefs": [
    "NCIt:C25218",
    "SNOMEDCT:50731006",
    "ICD10:Z41",
    "NCIt:C79751"
  ],
  "parents": []
}
JarrodBaker commented 2 years ago

The 'patch' for this release will be updating the file created by PIS with the following step:

jq -c '( select(.id == "EFO_0002571") ).isTherapeuticArea |= true' ontology-efo-v3.42.0.jsonl > ontology-efo-200622.jsonl

Running the ETL locally with this updated file shows that there are no diseases with no therapeutic areas:

df.filter('therapeuticAreas.isNull).count 
res6: Long = 0L
d0choa commented 2 years ago

Long-term solution to be addressed in https://github.com/EBISPOT/efo/issues/1636

JarrodBaker commented 2 years ago

The patched ontology file worked as expected when processed, and the bug is resolved. The output from David's original query:

{
  "data": {
    "search": {
      "total": 244,
      "hits": [
        {
          "id": "EFO_0009951",
          "highlights": [
            "response to <em>surgical</em> intervention",
            "activity of a cell or an organism as a result of <em>surgical</em> intervention."
          ],
          "object": {
            "id": "EFO_0009951"
          }
        },
        {
          "id": "ENSG00000271949",
          "highlights": [
            "<em>SURGICAL</em> AND MEDICAL PROCEDURES"
          ],
          "object": {
            "id": "ENSG00000271949"
          }
        },
...

We'll have to monitor this for the next release to make sure that either SPOT has a fix in place, or we run the patching process again.