Pharmacogenomics widget backend

jdhayhurst commented 11 months ago

Ticket outlining the backend work required for the Pharmacogenomics widget.

Background

See ticket for widget above. Schema

Tasks

[x] Spec API - check schema with FE
[x] PIS - fetch new datasource and stage it for ETL
[x] ~~ETL - merge ChEMBL ids from drug index using the CHEBI id as the key. Drop unmerged rows.~~ _Edit: ETL step is out of scope for '23_12' release but will be added in '2402'. Data team are generating the ETL output. So, for this release, create the necessary structure in the ETL to handle the future changes, but right now only copy the input to the output.
[x] POS - add dataset to Elastic(?)
[x] API - implement in the API - more details to follow.

Acceptance tests

How do we know the task is complete?

After running the pipeline the Pharmacogenomics data can be consumed from the API. It should be queryable by 'Targetfromsourceid' and 'Drugid'

DSuveges commented 11 months ago

@jdhayhurst , There's one bit, which just occured to me:

ETL - merge ChEMBL ids from drug index using the CHEBI id as the key. Drop unmerged rows.

Everywhere in the OT datasets we call drugId the ChEMBL drug identifiers. So having ChEBI is wrong. The ChEBI should be renaned to drugFromSourceId it would be consistent with evidence sourced from dataproviders, where warious diseases identifiers are stored under diseaseFromSourceId. I'm telling this to EVA, so for the next iteration we will have this field renamed.

Also tagging @ireneisdoomed , as he's preparing a patched dataset for this release.

ireneisdoomed commented 11 months ago

This is the transformed dataset that PIS needs to pick up: gs://otar012-eva/pharmacogenomics/cttv012-2023-10-23_pgkb.json.gz

Data schema:

root
 |-- drugId: string (nullable = true)
 |-- drugFromSourceId: string (nullable = true)
 |-- datasourceId: string (nullable = true)
 |-- datasourceVersion: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- drugFromSource: string (nullable = true)
 |-- evidenceLevel: string (nullable = true)
 |-- genotype: string (nullable = true)
 |-- genotypeAnnotationText: string (nullable = true)
 |-- genotypeId: string (nullable = true)
 |-- literature: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- pgxCategory: string (nullable = true)
 |-- phenotypeFromSourceId: string (nullable = true)
 |-- phenotypeText: string (nullable = true)
 |-- studyId: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
 |-- variantFunctionalConsequenceId: string (nullable = true)
 |-- variantRsId: string (nullable = true)
 |-- isDirectTarget: boolean (nullable = false)

The specification of the changes is here https://github.com/opentargets/issues/issues/3128

jdhayhurst commented 11 months ago

All changes complete: PIS: https://github.com/opentargets/platform-input-support/pull/114 ETL: https://github.com/opentargets/platform-etl-backend/pull/316 POS: https://github.com/opentargets/platform-output-support/pull/36 API: https://github.com/opentargets/platform-api/pull/151

ireneisdoomed commented 11 months ago

ETL has digested the new data gs://open-targets-pre-data-releases/23.12/output/etl/parquet/pharmacogenomics

Some QC

Same schema as the raw data. Same content as well. ✅
1692 records, representing the combination of: a drug, and a phenotype, a variant genotype, the genotype description, and the target where that variant is located

407 unique variants/targets/phenotypes

1055 unique variants/targets/phenotypes/drugs

1665 unique variants/genotypes/targets/phenotypes/drugs

We only represent higher confidence records (evidence level 1&2 denote at least moderate level of evidence supporting the variant-drug combinations)
Distribution per pgxCategory:
+-------------+-----+
|  pgxCategory|count|
+-------------+-----+
|        other|   45|
|       dosage|   51|
|metabolism/pk|   54|
|     efficacy|  291|
|     toxicity| 1251|
+-------------+-----+
580 records with a null target -> these variants will only show up in the drug widget

185 records with a null drug (due to 7 CHEBIs) -> these variants will only show up in the target widget

123 records with a null phenotype (example)

612 records with a null mapped phenotype

opentargets / issues