Closed jdhayhurst closed 9 months ago
@jdhayhurst , There's one bit, which just occured to me:
ETL - merge ChEMBL ids from drug index using the CHEBI id as the key. Drop unmerged rows.
Everywhere in the OT datasets we call drugId
the ChEMBL drug identifiers. So having ChEBI is wrong. The ChEBI should be renaned to drugFromSourceId
it would be consistent with evidence sourced from dataproviders, where warious diseases identifiers are stored under diseaseFromSourceId
. I'm telling this to EVA, so for the next iteration we will have this field renamed.
Also tagging @ireneisdoomed , as he's preparing a patched dataset for this release.
This is the transformed dataset that PIS needs to pick up: gs://otar012-eva/pharmacogenomics/cttv012-2023-10-23_pgkb.json.gz
Data schema:
root
|-- drugId: string (nullable = true)
|-- drugFromSourceId: string (nullable = true)
|-- datasourceId: string (nullable = true)
|-- datasourceVersion: string (nullable = true)
|-- datatypeId: string (nullable = true)
|-- drugFromSource: string (nullable = true)
|-- evidenceLevel: string (nullable = true)
|-- genotype: string (nullable = true)
|-- genotypeAnnotationText: string (nullable = true)
|-- genotypeId: string (nullable = true)
|-- literature: array (nullable = true)
| |-- element: string (containsNull = true)
|-- pgxCategory: string (nullable = true)
|-- phenotypeFromSourceId: string (nullable = true)
|-- phenotypeText: string (nullable = true)
|-- studyId: string (nullable = true)
|-- targetFromSourceId: string (nullable = true)
|-- variantFunctionalConsequenceId: string (nullable = true)
|-- variantRsId: string (nullable = true)
|-- isDirectTarget: boolean (nullable = false)
The specification of the changes is here https://github.com/opentargets/issues/issues/3128
ETL has digested the new data gs://open-targets-pre-data-releases/23.12/output/etl/parquet/pharmacogenomics
Same schema as the raw data. Same content as well. ✅
- 1692 records, representing the combination of: a drug, and a phenotype, a variant genotype, the genotype description, and the target where that variant is located
- 407 unique variants/targets/phenotypes
- 1055 unique variants/targets/phenotypes/drugs
- 1665 unique variants/genotypes/targets/phenotypes/drugs
- We only represent higher confidence records (evidence level 1&2 denote at least moderate level of evidence supporting the variant-drug combinations)
- Distribution per
pgxCategory
:+-------------+-----+ | pgxCategory|count| +-------------+-----+ | other| 45| | dosage| 51| |metabolism/pk| 54| | efficacy| 291| | toxicity| 1251| +-------------+-----+
- 580 records with a null target -> these variants will only show up in the drug widget
- 185 records with a null drug (due to 7 CHEBIs) -> these variants will only show up in the target widget
- 123 records with a null phenotype (example)
- 612 records with a null mapped phenotype
Ticket outlining the backend work required for the Pharmacogenomics widget.
Background
See ticket for widget above. Schema
Tasks
ETL - merge ChEMBL ids from drug index using the CHEBI id as the key. Drop unmerged rows._Edit: ETL step is out of scope for '23_12' release but will be added in '2402'. Data team are generating the ETL output. So, for this release, create the necessary structure in the ETL to handle the future changes, but right now only copy the input to the output.Acceptance tests
How do we know the task is complete?