opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Process PharmGKB data based on latest discussions #3128

Closed ireneisdoomed closed 7 months ago

ireneisdoomed commented 8 months ago

As the data team, we want to process the Pharmacogenomics data provided by EVA because it is in an iterative state.

Background

EVA has been working and processing pharmacogenomics data from PharmGKB. We've had many discussions as to how we should account for more granular scenarios like:

After reviewal of the data and widget specifications (https://github.com/opentargets/issues/issues/3113), we want to process the data provided by EVA so that it is digested by the ETL in the expected shape and format.

Tasks

Based on conversations with @buniello, Ellie, EVA, and the data team, and as a one off solution, for the 23.12 release we want to :

ireneisdoomed commented 8 months ago

New PharmGKB data has been added to the bucket: gs://otar012-eva/pharmacogenomics/cttv012-2023-10-23_pgkb.json.gz

This is the code used to produce it https://gist.github.com/ireneisdoomed/acb02331fc4866eece41aaea2fd9f7d3 This logic is expected to be integrated in the ETL for 24.02 @prashantuniyal02

ireneisdoomed commented 7 months ago

Ellie reported a bug in the flag we add to the data saying whether target is indirect or not. Looking at ivacaftor/CFTR, the flag was incorrectly saying that CFTR was not the drug target.

image

There was a bug in my logic (now fixed), where to build a look up table between each drug and their targets, I was filtering out those drugs with a single target as MoA. (f.size("linkedTargets.rows") = 1 instead of f.size("linkedTargets.rows") >= 1)

This is now fixed and available in the new file: gs://otar012-eva/pharmacogenomics/cttv012-2023-11-22_pgkb.json.gz

QC

Implications are important, however not massive in quantity. Most of the variants curated by PharmGKB refer to a gene different than the drug target.

# Distribution before the bugfix
+--------------+-----+                                                          
|isDirectTarget|count|
+--------------+-----+
|         false| 1690|
|          true|    2|
+--------------+-----+

# Distribution after the bugfix
+--------------+-----+                                                          
|isDirectTarget|count|
+--------------+-----+
|         false| 1587|
|          true|  105|
+--------------+-----+

I've checked other examples, and they look alright:

(The UI for the last example is incorrect as of today, but it'll show the target is direct once the data is updated.)

ireneisdoomed commented 7 months ago

This is done. @prashantuniyal02, we have to remember to come back to this for the integration of the logic into the ETL.