opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Pharmacogenomics widget backend #3127

Closed jdhayhurst closed 9 months ago

jdhayhurst commented 11 months ago

Ticket outlining the backend work required for the Pharmacogenomics widget.

Background

See ticket for widget above. Schema

Tasks

Acceptance tests

How do we know the task is complete?

  1. After running the pipeline the Pharmacogenomics data can be consumed from the API. It should be queryable by 'Targetfromsourceid' and 'Drugid'
DSuveges commented 11 months ago

@jdhayhurst , There's one bit, which just occured to me:

ETL - merge ChEMBL ids from drug index using the CHEBI id as the key. Drop unmerged rows.

Everywhere in the OT datasets we call drugId the ChEMBL drug identifiers. So having ChEBI is wrong. The ChEBI should be renaned to drugFromSourceId it would be consistent with evidence sourced from dataproviders, where warious diseases identifiers are stored under diseaseFromSourceId. I'm telling this to EVA, so for the next iteration we will have this field renamed.

Also tagging @ireneisdoomed , as he's preparing a patched dataset for this release.

ireneisdoomed commented 11 months ago

This is the transformed dataset that PIS needs to pick up: gs://otar012-eva/pharmacogenomics/cttv012-2023-10-23_pgkb.json.gz

Data schema:

root
 |-- drugId: string (nullable = true)
 |-- drugFromSourceId: string (nullable = true)
 |-- datasourceId: string (nullable = true)
 |-- datasourceVersion: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- drugFromSource: string (nullable = true)
 |-- evidenceLevel: string (nullable = true)
 |-- genotype: string (nullable = true)
 |-- genotypeAnnotationText: string (nullable = true)
 |-- genotypeId: string (nullable = true)
 |-- literature: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- pgxCategory: string (nullable = true)
 |-- phenotypeFromSourceId: string (nullable = true)
 |-- phenotypeText: string (nullable = true)
 |-- studyId: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
 |-- variantFunctionalConsequenceId: string (nullable = true)
 |-- variantRsId: string (nullable = true)
 |-- isDirectTarget: boolean (nullable = false)

The specification of the changes is here https://github.com/opentargets/issues/issues/3128

jdhayhurst commented 11 months ago

All changes complete: PIS: https://github.com/opentargets/platform-input-support/pull/114 ETL: https://github.com/opentargets/platform-etl-backend/pull/316 POS: https://github.com/opentargets/platform-output-support/pull/36 API: https://github.com/opentargets/platform-api/pull/151

ireneisdoomed commented 11 months ago

ETL has digested the new data gs://open-targets-pre-data-releases/23.12/output/etl/parquet/pharmacogenomics

Some QC

Same schema as the raw data. Same content as well. ✅

  • 1692 records, representing the combination of: a drug, and a phenotype, a variant genotype, the genotype description, and the target where that variant is located
  • 407 unique variants/targets/phenotypes
  • 1055 unique variants/targets/phenotypes/drugs
  • 1665 unique variants/genotypes/targets/phenotypes/drugs
  • We only represent higher confidence records (evidence level 1&2 denote at least moderate level of evidence supporting the variant-drug combinations)
  • Distribution per pgxCategory:
    +-------------+-----+
    |  pgxCategory|count|
    +-------------+-----+
    |        other|   45|
    |       dosage|   51|
    |metabolism/pk|   54|
    |     efficacy|  291|
    |     toxicity| 1251|
    +-------------+-----+
  • 580 records with a null target -> these variants will only show up in the drug widget
  • 185 records with a null drug (due to 7 CHEBIs) -> these variants will only show up in the target widget
  • 123 records with a null phenotype (example)
  • 612 records with a null mapped phenotype