Handle drug combinations in pharmacogenetics

ireneisdoomed commented 8 months ago

As the data team, I want to handle pharmacogenetic evidence involving drug combinations where currently no drugId is mapped because this will ensure a more accurate representation of all pharmacogenetic evidence.

Background

Our current pharmacogenetics data model is structured to handle evidence involving a single variant/drug interaction. We have encountered a limitation where evidence involving drug combinations is not mapped to any drug IDs. CFTR w/ ivacaftor/lumacaftor is an example.

Important to note that this issue is really limited, I've only found 6 offending cases, all of them related with CFTR variation.

+--------------------------+----------------------+------+
|genotypeId                |drugFromSource        |drugId|
+--------------------------+----------------------+------+
|7_117559591_TCTT_TCTT,TCTT|ivacaftor / lumacaftor|null  |
|7_117559591_TCTT_T,T      |ivacaftor / tezacaftor|null  |
|7_117559591_TCTT_T,TCTT   |ivacaftor / tezacaftor|null  |
|7_117559591_TCTT_T,TCTT   |ivacaftor / lumacaftor|null  |
|7_117559591_TCTT_T,T      |ivacaftor / lumacaftor|null  |
|7_117559591_TCTT_TCTT,TCTT|ivacaftor / tezacaftor|null  |
+--------------------------+----------------------+------+

Proposed Solutions

Two potential solutions are considered:

Expanding drugFromSource and drugId fields to allow lists. This would be more future-proof, accommodating drug combinations but requires significant alterations to the data model.
Exploding evidenceinto separate records for each drug. This approach kind of duplicates the data, but it maintains its structure.

Given the rarity of the drug combinations, I think that exploding is the most practical solution. The text that accompanies the genotype clearly specifies that the response is observed for the combination, so this field should be taken into consideration as context.

Tasks

This is not meant to be done for 24.03.

[ ] If we all agree on this solution, we can talk to EVA in our next meeting for them to explode the evidence for 24.06.

Acceptance tests

When I go to the CFTR PGx widget, all evidence have a mapped drugId.

DSuveges commented 7 months ago

Even if the evidence is exploded to multiple drugId-s, I think it is possible to keep the drugFromSource as it is, so indicating the evidence is a combination in nature.

ireneisdoomed commented 7 months ago

We actually use the string drugFromSource in the drug mapping step to assign a drugId, I think it is more practical to explode.

DSuveges commented 7 months ago

Yes, we should explode, however it would be quite informative to know that these annotations are drug combination, there should be some provenance for that.

ireneisdoomed commented 5 months ago

After a discussion with @DSuveges and the EVA team, we have decided that we want to keep an accurate definition of the drugs involved in the PGx evidence. This means that we will capture the scenario where a genotype affects the response to a combination of drugs (as opposed to only allow for single drugs)

This means that we are changing the schema:

drugFromSourceId is dropped. CHEBIs won't be submitted anymore (see https://github.com/opentargets/json_schema/pull/194 for context)
drugId and drugFromSource will now be encapsulated as a struct. So these fields are removed
drugs is added. This is a list of objects with 2 keys:
- drugFromSource: name of the drug at source
- drugId: CHEMBL ID that will be populated by us later on in the drug mapping step

The changes in the schema have downstream implications:

@opentargets/be-team

[ ] [Pharmacogenomics.scala ](https://github.com/opentargets/platform-etl-backend/blob/master/src/main/scala/io/opentargets/etl/backend/pharmacogenomics/Pharmacogenomics.scala is affected. You need to transform the df before extracting the mapping, and to maintain the struct
[ ] We need to propagate the schema changes to the API endpoint

I'm very happy to sit next week with someone of the team to get this done. It should be straightforward

@opentargets/fe-team

[ ] Accommodate the API changes in the UI. Take CFTR as an example, where we will have evidence with 2 drugs: ivacaftor and lumacaftor. The expectation is that I can click on both drug names, linking to the respective drug page.

cc @prashantuniyal02

mbdebian commented 4 months ago

I'm looking at the new pharmacogenomics dataset as collected by PIS:

root
 |-- datasourceId: string (nullable = true)
 |-- datasourceVersion: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- directionality: string (nullable = true)
 |-- drugs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- drugFromSource: string (nullable = true)
 |-- evidenceLevel: string (nullable = true)
 |-- genotype: string (nullable = true)
 |-- genotypeAnnotationText: string (nullable = true)
 |-- genotypeId: string (nullable = true)
 |-- haplotypeFromSourceId: string (nullable = true)
 |-- haplotypeId: string (nullable = true)
 |-- literature: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- pgxCategory: string (nullable = true)
 |-- phenotypeFromSourceId: string (nullable = true)
 |-- phenotypeText: string (nullable = true)
 |-- studyId: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
 |-- variantFunctionalConsequenceId: string (nullable = true)
 |-- variantRsId: string (nullable = true)

I'll review the ETL step pointed out by @ireneisdoomed

ireneisdoomed commented 4 months ago

@mbdebian After looking at the outputs after the changes in the drug mapping part of the ETL, I think they need a revision.

I observed missing rows between the PGx input (31,618) and the ETL output (24,573). This discrepancy is due to the way drug information is handled here:

Exploding Drug Arrays: The ETL process explodes arrays of drugs associated with each evidence record, creating separate rows for each drug. This explosion should be very small, given that the majority of evidence inform about a single drug.
Grouping and Aggregation: The data is then grouped based on all columns except the drug information, and the drug values are aggregated back into a list.

This approach merges evidence that should be kept separate. Let me explain more graphically:	evidenceId	response
1	trait_a	a
2	trait_a	b
3	trait_a	c, d

1, 2, 3 inform about the same effect given the same variant, the only thing that changes is the exposure (the drug) after which the response is observed
1, 2 inform about a response that is observed when you take drug a OR drug b
3 informa about the response that is observed when you take drug c AND drug d in combination

The current transformation loses this distinction, resulting in a single evidence record that lists all drugs associated with a given response, independently of whether they are individual or combinatorial:	evidenceId	response	drugs
1	trait_a	a, b, c, d

For this reason, I'd suggest revising this step and keeping the drug array structure and use another strategy like transform to add the drug ID or adding an ID just before exploding to uniquely identify each evidence. For QC, I suggest you check the sizes of the drugs array, that's how I identified the issue:

# Post ETL 
+-----------+-----+
|size(drugs)|count|
+-----------+-----+
|          1|20600|
|          6|   79|
|          3|  849|
|          4|  268|
|          8|    3|
|          5|  126|
|          2| 2453|
|          7|  165|
|         10|   27|
|         13|    3|
+-----------+-----+

# Pre ETL
+-----------+-----+
|size(drugs)|count|
+-----------+-----+
|          1|31516|
|          2|  100|
|          3|    2|
+-----------+-----+

mbdebian commented 4 months ago

I have changed the code to use an operational row ID, and after running the pharmacogenomics step, I can confirm that the drug count stays the same:

Output

scala> outPharma.withColumn("drugsSize", size(col("drugs"))).groupBy("drugsSize").agg(count("*").as("count")).orderBy(col("count").desc).show
+---------+-----+
|drugsSize|count|
+---------+-----+
|        1|42358|
|        2|  164|
|        3|    2|
+---------+-----+

Original pharmacogenomics input dataset

scala> original.withColumn("drugsSize", size(col("drugs"))).groupBy("drugsSize").agg(count("*").as("count")).orderBy(col("count").desc).show
+---------+-----+
|drugsSize|count|
+---------+-----+
|        1|42358|
|        2|  164|
|        3|    2|
+---------+-----+

The output for this run just for pharmacognomics can be found at gs://ot-team/mbernal/data/issue_3205/output

mbdebian commented 4 months ago

I just corrected the bug in the grouping operation, results are in the same place gs://ot-team/mbernal/data/issue_3205/output, the schema is now correct and the cardinality preserved.

And I have also run ETL just for pharmacogenomics, so the data at gs://open-targets-pre-data-releases/24.06/output/etl/**/pharmacogenomics is up to date.

mbdebian commented 4 months ago

ETL changes completed, now it needs to be addressed at the API

jdhayhurst commented 4 months ago

Hi, the only issue I can see in implementing this into the API is that currently the API resolves a drug field, which is a Drug type object, by doing a lookup on drugId. Now that we have multiple drugId's, we break the resolver which is going from one drugId to one drug. Do we need to keep drug in the pharmacogenomics object (I don't think it's actually being used by the frontend)? If so, how should we handle multiple drug objects (I'd have thought a separate list of drugs would be sufficient)? Thanks!

ireneisdoomed commented 4 months ago

Based on a conversation offline with @jdhayhurst we have set that:

The reason the PGx endpoint has drugId, drugFromSource and drug is to account for the cases where we couldn't map the drug to any ID
The front end is not using drug at all, it is just displaying drugFromSource and linking to the respective drug page with drugId
We want to use the drug object every time we want to represent a drug

Given this, we propose a new type to represent drugs DrugWithIdentifiers { drug: Drug, drugFromSource: String, drugId: String }. This object is adaptable to both scenarios:

I have a drugFromSource that is mapped therefore drug can be resolved
I have a drugFromSource that is not mapped therefore i can just show drugFromSource

Actions:

[x] Change the representation of the drugs in the Pharmacogenomics endpoint to drugs: Seq[DrugWithIdentifiers]
[ ] (Not directly related to this issue, but the situation is practically the same) Change the drug representation in the ChemicalProbes endpoint to id: DrugWithIdentifiers and delete drugId

jdhayhurst commented 3 months ago

The data model has now been updated:


type Pharmacogenomics {
  datasourceId: String
  datatypeId: String
  evidenceLevel: String
  genotype: String
  genotypeAnnotationText: String
  genotypeId: String
  haplotypeFromSourceId: String
  haplotypeId: String
  literature: [String!]
  pgxCategory: String
  phenotypeFromSourceId: String
  phenotypeText: String
  studyId: String
  targetFromSourceId: String
  variantFunctionalConsequenceId: String
  variantRsId: String
  isDirectTarget: Boolean!
  variantFunctionalConsequence: SequenceOntologyTerm

  # Target entity
  target: Target

  # Drug List
  drugs: [DrugWithIdentifiers!]!
}

# Drug with drug identifiers
type DrugWithIdentifiers {
  drugId: String
  drugFromSource: String

  # Drug entity
  drug: Drug
}

buniello commented 3 months ago

I can see the FE change in dev and looking good

ireneisdoomed commented 3 months ago

Looking great! The genes with the maximum number of drugs in combination are ENSG00000244122 or ENSG00000167165, everything looks nice.

@gjmcn Take into account this will have an impact in the variant page work (comment)

prashantuniyal02 commented 3 months ago

released in Platform 24.06

opentargets / issues