Closed d0choa closed 2 years ago
The log output from that query is:
[error] m.ElasticRetriever - ((5)/_source/category,List(JsonValidationError(List(error.path.missing),List()))) | ((8)/_source/category,List(JsonValidationError(List(error.path.missing),List())))
Indicating that in SearchResults.scala
there is a problem parsing the JSON coming from Elasticsearch. We expect that all entries have a non-null field category
, which presumably isn't the case at present.
The query
GET /search_drug,search_target,search_disease/_search
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "category"
}
}
}
}
}
Indicating that there are 62 entries with no category
field.
Of these:
search_disease = 55, these are all missing the field entirely. search_target = 0 search_drug = 7, these all have empty arrays.
Checking the disease index
GET /disease/_search
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "therapeuticAreas"
}
}
}
}
}
Shows that there are indeed 55 entries missing the field in question.
The problem is presumably in the Disease step of the ETL.
For context, this is the change that we have made that might have had unexpected consequences https://github.com/opentargets/issues/issues/2588
I didn't count them manually, but they are probably on the range of the offending disease terms
Voilà
>>> diseases.withColumn("TAs", F.explode_outer("therapeuticAreas")).filter(F.col("TAs").isNull()).select("name", "TAs").show(100, truncate = False)
+----------------------------------------------+----+
|name |TAs |
+----------------------------------------------+----+
|medical procedure |null|
|heart valve prosthesis |null|
|hysterectomy |null|
|percutaneous transluminal coronary angioplasty|null|
|coronary artery bypass |null|
|circumcision |null|
|Xenograft |null|
|pancreatectomy |null|
|dissection |null|
|invasive mechanical ventilation |null|
|cardioverter defibrillator |null|
|response to intravenous immunoglobulin therapy|null|
|Therapeutic Procedure |null|
|prophylactic surgery |null|
|ileostomy |null|
|renal dialysis |null|
|Pharmacotherapy |null|
|contraception |null|
|blood transfusion |null|
|blood vessel replacement |null|
|cardiac transplant |null|
|dentures |null|
|sedation |null|
|total knee arthroplasty |null|
|radical prostatectomy |null|
|cardiac ablation |null|
|total hip arthroplasty |null|
|ophthalmic procedure |null|
|digestive system surgery |null|
|continuous positive airway pressure |null|
|mastectomy |null|
|artificial cardiac pacemaker |null|
|appendectomy |null|
|rehabilitation |null|
|liver transplant |null|
|bone marrow transplantation |null|
|amputation |null|
|checkup |null|
|cesarean section |null|
|cornea transplantation |null|
|follow-up |null|
|bilateral oophorectomy |null|
|revision of total knee arthroplasty |null|
|dependence on enabling machines and devices |null|
|revision of total hip arthroplasty |null|
|cochlear implant |null|
|revision of total joint arthroplasty |null|
|pregnancy test |null|
|total joint arthroplasty |null|
|cadaver dissection |null|
|gastric bypass |null|
|orthopedic nursing |null|
|surgery on leg artery |null|
|lung transplantation |null|
|kidney transplant |null|
+----------------------------------------------+----+
EFO input looks similar to other root terms (e.g. phenotype). So everything points to the ETL
❯ gsutil cat gs://open-targets-pre-data-releases/22.06.1/input/ontology-inputs/diseases_efo.jsonl | jq 'select(.id == "EFO_0002571")'
{
"id": "EFO_0002571",
"parentIds": [],
"name": "medical procedure"
}
❯ gsutil cat gs://open-targets-pre-data-releases/22.06.1/input/ontology-inputs/diseases_efo.jsonl | jq 'select(.id == "EFO_0000651")'
{
"id": "EFO_0000651",
"parentIds": [],
"name": "phenotype"
}
For an unknown reason medical procedure
in the ETL outputs has "isTherapeuticArea": false
when it should be true
.
The disease step is a little convoluted as it still has much of the logic in PIS, rather than having everything in the ETL. The EFO owl file is downloaded and converted to JSON using Riot as specified in the configuration variable disease.etl.efo.owl_jq
.
The results of this conversion are saved under staging/ontology-inputs/efo_otar_slim.json
.
This file is further manipulated to produce ontology-efo-v3.42.0.jsonl
which is used as the input to the ETL. It is here that the value of isTherapeuticArea
is set. The code checks whether a disease include the field oboInOwl:inSubset
, and if so sets the isTherapeuticArea
flag. The schema of the output file is:
root
|-- code: string (nullable = true)
|-- dbXRefs: array (nullable = true)
| |-- element: string (containsNull = true)
|-- definition: string (nullable = true)
|-- definition_alternatives: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: string (nullable = true)
|-- isTherapeuticArea: boolean (nullable = true)
|-- label: string (nullable = true)
|-- locationIds: array (nullable = true)
| |-- element: string (containsNull = true)
|-- obsoleteTerms: array (nullable = true)
| |-- element: string (containsNull = true)
|-- parents: array (nullable = true)
| |-- element: string (containsNull = true)
|-- sko: array (nullable = true)
| |-- element: string (containsNull = true)
|-- synonyms: struct (nullable = true)
| |-- hasBroadSynonym: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- hasExactSynonym: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- hasNarrowSynonym: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- hasRelatedSynonym: array (nullable = true)
| | |-- element: string (containsNull = true)
Looking at the entry for EFO_0009517:
<!-- http://www.ebi.ac.uk/efo/EFO_0009517 -->
<Class rdf:about="http://www.ebi.ac.uk/efo/EFO_0009517">
<rdfs:subClassOf rdf:resource="http://www.ebi.ac.uk/efo/EFO_0002571"/>
<obo:IAO_0000115>A general examination or inspection, especially one carried out by a doctor or dentist. [ NCI ]</obo:IAO_0000115>
<oboInOwl:hasDbXref>ICD10:Z00</oboInOwl:hasDbXref>
<oboInOwl:hasDbXref>ICD10:Z01</oboInOwl:hasDbXref>
<oboInOwl:hasDbXref>ICD10:Z03</oboInOwl:hasDbXref>
<oboInOwl:hasDbXref>ICD10:Z04</oboInOwl:hasDbXref>
<oboInOwl:hasDbXref>ICD10:Z10</oboInOwl:hasDbXref>
<oboInOwl:hasDbXref>ICD10:Z11</oboInOwl:hasDbXref>
<oboInOwl:hasDbXref>ICD10:Z12</oboInOwl:hasDbXref>
<oboInOwl:hasDbXref>ICD10:Z13</oboInOwl:hasDbXref>
<oboInOwl:hasDbXref>NCIt:C41383</oboInOwl:hasDbXref>
<oboInOwl:hasExactSynonym>check up</oboInOwl:hasExactSynonym>
<oboInOwl:hasExactSynonym>check-up</oboInOwl:hasExactSynonym>
<oboInOwl:hasExactSynonym>medical examination</oboInOwl:hasExactSynonym>
<rdfs:label>checkup</rdfs:label>
</Class>
There is no field <oboInOwl:inSubset>
so PIS marks the entry as isTherapeuticArea = false
.
The consequence of the above is that medical procedure
is not exposed as a therapeutic area to the ETL due to PIS missing the <oboInOwl:inSubset>
tag
(base) base ❯ gsutil cat gs://open-targets-pre-data-releases/22.06.1/input/ontology-inputs/ontology-efo-v3.42.0.jsonl | jq 'select(.id == "EFO_0002571")'
{
"id": "EFO_0002571",
"code": "http://www.ebi.ac.uk/efo/EFO_0002571",
"label": "medical procedure",
"definition": "An activity that produces an effect, or that is intended to alter the course of a disease in a patient or population. This is a general term that encompasses the medical, social, behavioral, and environmental acts that can have preventive, therapeutic, or palliative effects.",
"isTherapeuticArea": false,
"synonyms": {
"hasExactSynonym": [
"Procedure",
"Intervention Strategies",
"interventionDescription",
"Interventional",
"Intervention",
"SURGICAL AND MEDICAL PROCEDURES",
"Intervention or Procedure"
]
},
"dbXRefs": [
"NCIt:C25218",
"SNOMEDCT:50731006",
"ICD10:Z41",
"NCIt:C79751"
],
"parents": []
}
The 'patch' for this release will be updating the file created by PIS with the following step:
jq -c '( select(.id == "EFO_0002571") ).isTherapeuticArea |= true' ontology-efo-v3.42.0.jsonl > ontology-efo-200622.jsonl
Running the ETL locally with this updated file shows that there are no diseases with no therapeutic areas:
df.filter('therapeuticAreas.isNull).count
res6: Long = 0L
Long-term solution to be addressed in https://github.com/EBISPOT/efo/issues/1636
The patched ontology file worked as expected when processed, and the bug is resolved. The output from David's original query:
{
"data": {
"search": {
"total": 244,
"hits": [
{
"id": "EFO_0009951",
"highlights": [
"response to <em>surgical</em> intervention",
"activity of a cell or an organism as a result of <em>surgical</em> intervention."
],
"object": {
"id": "EFO_0009951"
}
},
{
"id": "ENSG00000271949",
"highlights": [
"<em>SURGICAL</em> AND MEDICAL PROCEDURES"
],
"object": {
"id": "ENSG00000271949"
}
},
...
We'll have to monitor this for the next release to make sure that either SPOT has a fix in place, or we run the patching process again.
When performing the next query, the API produces no results:
This is not the same for other sources:
With my magic ball, I suspect there is a problem with the medical procedures that were added to the disease index in this release. There must be something on the disease dataset or alternatively in the search ETL that it's messing with this.