opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Handle drug combinations in pharmacogenetics #3205

Closed ireneisdoomed closed 3 months ago

ireneisdoomed commented 8 months ago

As the data team, I want to handle pharmacogenetic evidence involving drug combinations where currently no drugId is mapped because this will ensure a more accurate representation of all pharmacogenetic evidence.

Background

Our current pharmacogenetics data model is structured to handle evidence involving a single variant/drug interaction. We have encountered a limitation where evidence involving drug combinations is not mapped to any drug IDs. CFTR w/ ivacaftor/lumacaftor is an example.

image

Important to note that this issue is really limited, I've only found 6 offending cases, all of them related with CFTR variation.

+--------------------------+----------------------+------+
|genotypeId                |drugFromSource        |drugId|
+--------------------------+----------------------+------+
|7_117559591_TCTT_TCTT,TCTT|ivacaftor / lumacaftor|null  |
|7_117559591_TCTT_T,T      |ivacaftor / tezacaftor|null  |
|7_117559591_TCTT_T,TCTT   |ivacaftor / tezacaftor|null  |
|7_117559591_TCTT_T,TCTT   |ivacaftor / lumacaftor|null  |
|7_117559591_TCTT_T,T      |ivacaftor / lumacaftor|null  |
|7_117559591_TCTT_TCTT,TCTT|ivacaftor / tezacaftor|null  |
+--------------------------+----------------------+------+

Proposed Solutions

Two potential solutions are considered:

  1. Expanding drugFromSource and drugId fields to allow lists. This would be more future-proof, accommodating drug combinations but requires significant alterations to the data model.
  2. Exploding evidenceinto separate records for each drug. This approach kind of duplicates the data, but it maintains its structure.

Given the rarity of the drug combinations, I think that exploding is the most practical solution. The text that accompanies the genotype clearly specifies that the response is observed for the combination, so this field should be taken into consideration as context.

Tasks

This is not meant to be done for 24.03.

Acceptance tests

  1. When I go to the CFTR PGx widget, all evidence have a mapped drugId.
DSuveges commented 7 months ago

Even if the evidence is exploded to multiple drugId-s, I think it is possible to keep the drugFromSource as it is, so indicating the evidence is a combination in nature.

ireneisdoomed commented 7 months ago

We actually use the string drugFromSource in the drug mapping step to assign a drugId, I think it is more practical to explode.

DSuveges commented 7 months ago

Yes, we should explode, however it would be quite informative to know that these annotations are drug combination, there should be some provenance for that.

ireneisdoomed commented 5 months ago

After a discussion with @DSuveges and the EVA team, we have decided that we want to keep an accurate definition of the drugs involved in the PGx evidence. This means that we will capture the scenario where a genotype affects the response to a combination of drugs (as opposed to only allow for single drugs)

This means that we are changing the schema:

The changes in the schema have downstream implications:

@opentargets/be-team

I'm very happy to sit next week with someone of the team to get this done. It should be straightforward

@opentargets/fe-team

cc @prashantuniyal02

mbdebian commented 4 months ago

I'm looking at the new pharmacogenomics dataset as collected by PIS:

root
 |-- datasourceId: string (nullable = true)
 |-- datasourceVersion: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- directionality: string (nullable = true)
 |-- drugs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- drugFromSource: string (nullable = true)
 |-- evidenceLevel: string (nullable = true)
 |-- genotype: string (nullable = true)
 |-- genotypeAnnotationText: string (nullable = true)
 |-- genotypeId: string (nullable = true)
 |-- haplotypeFromSourceId: string (nullable = true)
 |-- haplotypeId: string (nullable = true)
 |-- literature: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- pgxCategory: string (nullable = true)
 |-- phenotypeFromSourceId: string (nullable = true)
 |-- phenotypeText: string (nullable = true)
 |-- studyId: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
 |-- variantFunctionalConsequenceId: string (nullable = true)
 |-- variantRsId: string (nullable = true)

I'll review the ETL step pointed out by @ireneisdoomed

ireneisdoomed commented 4 months ago

@mbdebian After looking at the outputs after the changes in the drug mapping part of the ETL, I think they need a revision.

I observed missing rows between the PGx input (31,618) and the ETL output (24,573). This discrepancy is due to the way drug information is handled here:

This approach merges evidence that should be kept separate. Let me explain more graphically: evidenceId response drugs
1 trait_a a
2 trait_a b
3 trait_a c, d
The current transformation loses this distinction, resulting in a single evidence record that lists all drugs associated with a given response, independently of whether they are individual or combinatorial: evidenceId response drugs
1 trait_a a, b, c, d

For this reason, I'd suggest revising this step and keeping the drug array structure and use another strategy like transform to add the drug ID or adding an ID just before exploding to uniquely identify each evidence. For QC, I suggest you check the sizes of the drugs array, that's how I identified the issue:

# Post ETL 
+-----------+-----+
|size(drugs)|count|
+-----------+-----+
|          1|20600|
|          6|   79|
|          3|  849|
|          4|  268|
|          8|    3|
|          5|  126|
|          2| 2453|
|          7|  165|
|         10|   27|
|         13|    3|
+-----------+-----+

# Pre ETL
+-----------+-----+
|size(drugs)|count|
+-----------+-----+
|          1|31516|
|          2|  100|
|          3|    2|
+-----------+-----+
mbdebian commented 4 months ago

I have changed the code to use an operational row ID, and after running the pharmacogenomics step, I can confirm that the drug count stays the same:

Output

scala> outPharma.withColumn("drugsSize", size(col("drugs"))).groupBy("drugsSize").agg(count("*").as("count")).orderBy(col("count").desc).show
+---------+-----+
|drugsSize|count|
+---------+-----+
|        1|42358|
|        2|  164|
|        3|    2|
+---------+-----+

Original pharmacogenomics input dataset

scala> original.withColumn("drugsSize", size(col("drugs"))).groupBy("drugsSize").agg(count("*").as("count")).orderBy(col("count").desc).show
+---------+-----+
|drugsSize|count|
+---------+-----+
|        1|42358|
|        2|  164|
|        3|    2|
+---------+-----+

The output for this run just for pharmacognomics can be found at gs://ot-team/mbernal/data/issue_3205/output

mbdebian commented 4 months ago

I just corrected the bug in the grouping operation, results are in the same place gs://ot-team/mbernal/data/issue_3205/output, the schema is now correct and the cardinality preserved.

And I have also run ETL just for pharmacogenomics, so the data at gs://open-targets-pre-data-releases/24.06/output/etl/**/pharmacogenomics is up to date.

mbdebian commented 4 months ago

ETL changes completed, now it needs to be addressed at the API

jdhayhurst commented 4 months ago

Hi, the only issue I can see in implementing this into the API is that currently the API resolves a drug field, which is a Drug type object, by doing a lookup on drugId. Now that we have multiple drugId's, we break the resolver which is going from one drugId to one drug. Do we need to keep drug in the pharmacogenomics object (I don't think it's actually being used by the frontend)? If so, how should we handle multiple drug objects (I'd have thought a separate list of drugs would be sufficient)? Thanks!

ireneisdoomed commented 4 months ago

Based on a conversation offline with @jdhayhurst we have set that:

Given this, we propose a new type to represent drugs DrugWithIdentifiers { drug: Drug, drugFromSource: String, drugId: String }. This object is adaptable to both scenarios:

  1. I have a drugFromSource that is mapped therefore drug can be resolved
  2. I have a drugFromSource that is not mapped therefore i can just show drugFromSource

Actions:

jdhayhurst commented 3 months ago

The data model has now been updated:


type Pharmacogenomics {
  datasourceId: String
  datatypeId: String
  evidenceLevel: String
  genotype: String
  genotypeAnnotationText: String
  genotypeId: String
  haplotypeFromSourceId: String
  haplotypeId: String
  literature: [String!]
  pgxCategory: String
  phenotypeFromSourceId: String
  phenotypeText: String
  studyId: String
  targetFromSourceId: String
  variantFunctionalConsequenceId: String
  variantRsId: String
  isDirectTarget: Boolean!
  variantFunctionalConsequence: SequenceOntologyTerm

  # Target entity
  target: Target

  # Drug List
  drugs: [DrugWithIdentifiers!]!
}

# Drug with drug identifiers
type DrugWithIdentifiers {
  drugId: String
  drugFromSource: String

  # Drug entity
  drug: Drug
}
buniello commented 3 months ago

I can see the FE change in dev and looking good

ireneisdoomed commented 3 months ago

Looking great! The genes with the maximum number of drugs in combination are ENSG00000244122 or ENSG00000167165, everything looks nice.

@gjmcn Take into account this will have an impact in the variant page work (comment)

prashantuniyal02 commented 3 months ago

released in Platform 24.06

Image