opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Obsoleted disease terms lost in diseaseToPhenotype dataset #2778

Open d0choa opened 1 year ago

d0choa commented 1 year ago

As reported in a recent community post, some diseases that used to have disease2phenotype relationships (ETL code) don't have it anymore.

In the hpo-phenotypes.jsonl 22.09 input file we can find entries linked to an obsoleted ID ("ORPHA:217607") that the user has reported.

> gsutil cat gs://open-targets-data-releases/22.09/input/ontology-inputs/hpo-phenotypes.jsonl | grep '"Familial dilated cardiomyopathy"'

{"databaseId": "ORPHA:217607", "diseaseName": "Familial dilated cardiomyopathy", "qualifier": null, "HPOId": "HP:0001712", "references": ["ORPHA:217607"], "evidenceType": "TAS", "onset": null, "frequency": "HP_0040281", "sex": null, "modifiers": null, "aspect": "P", "biocuration": "ORPHA:orphadata[2022-06-11]", "resource": "HPO"}
{"databaseId": "ORPHA:217607", "diseaseName": "Familial dilated cardiomyopathy", "qualifier": null, "HPOId": "HP:0025169", "references": ["ORPHA:217607"], "evidenceType": "TAS", "onset": null, "frequency": "HP_0040281", "sex": null, "modifiers": null, "aspect": "P", "biocuration": "ORPHA:orphadata[2022-06-11]", "resource": "HPO"}
...

Because the ORPHA:217607 ID has been obsoleted in EFO, my impression is that we are dropping all the records.

Using the disease index we can rescue the obsoleted IDs in the same way that we rescue them for the purpose of evidence. An example of how the relevant ID is found in our disease index linked to MONDO_0016333 (with the annoying ORPHA == Orphanet conversion).

❯ gsutil cat 'gs://open-targets-data-releases/22.09/output/etl/json/diseases/*.json' | jq 'select(.id == "MONDO_0016333") | [{id:.id, obsoleteTerms:.obsoleteTerms}]'

[
  {
    "id": "MONDO_0016333",
    "obsoleteTerms": [
      "Orphanet_217607"
    ]
  }
]

Half-baked feature, bug or enhancement, depending on how you see it ;)

DSuveges commented 1 year ago

The PIS ingest human phenotype ontology from obolibrary and structures it without much transformation.

I would suggest to add a resolveDiseases step in the disease object generation, similar when ingesting disease/target evidence, to check if the provided disease identifier is an obsoleted id for an existing term. I know, I'm proposing to check against a dataset that is just being generated, so that's adds a bit of complexity.

I'm wondering if such logic could be abstracted so a "validation" step would be executed any time a disease is ingested by the ETL (regardless what data type of source it provides). It would prevent introducing discrepancies.

d0choa commented 1 year ago

@JarrodBaker could you help us scope the task? would it require much work in the ETL?

mbdebian commented 1 year ago

@d0choa , I wonder whether we can close this issue, as it's related to a release from last year.

mbdebian commented 1 year ago

I'll close it, but if you think it's still relevant, please, feel free to re-open it.

prashantuniyal02 commented 1 year ago

https://community.opentargets.org/t/duplicated-diseases-in-22-06/702/1

DSuveges commented 1 year ago

This is newly opened bug report is related to this issue: https://github.com/opentargets/issues/issues/2929

Practically, etl needs to resolve disease in all datasets where we get disease information not only evidence.

mbdebian commented 5 months ago

@remo87 , would you mind getting together with @tskir to find out whether this is still relevant and what are the next steps? Thanks!