Closed ireneisdoomed closed 3 months ago
Even if the evidence is exploded to multiple drugId
-s, I think it is possible to keep the drugFromSource
as it is, so indicating the evidence is a combination in nature.
We actually use the string drugFromSource
in the drug mapping step to assign a drugId
, I think it is more practical to explode.
Yes, we should explode, however it would be quite informative to know that these annotations are drug combination, there should be some provenance for that.
After a discussion with @DSuveges and the EVA team, we have decided that we want to keep an accurate definition of the drugs involved in the PGx evidence. This means that we will capture the scenario where a genotype affects the response to a combination of drugs (as opposed to only allow for single drugs)
This means that we are changing the schema:
drugFromSourceId
is dropped. CHEBIs won't be submitted anymore (see https://github.com/opentargets/json_schema/pull/194 for context)drugId
and drugFromSource
will now be encapsulated as a struct. So these fields are removeddrugs
is added. This is a list of objects with 2 keys:
drugFromSource
: name of the drug at sourcedrugId
: CHEMBL ID that will be populated by us later on in the drug mapping step The changes in the schema have downstream implications:
@opentargets/be-team
I'm very happy to sit next week with someone of the team to get this done. It should be straightforward
@opentargets/fe-team
cc @prashantuniyal02
I'm looking at the new pharmacogenomics dataset as collected by PIS:
root
|-- datasourceId: string (nullable = true)
|-- datasourceVersion: string (nullable = true)
|-- datatypeId: string (nullable = true)
|-- directionality: string (nullable = true)
|-- drugs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- drugFromSource: string (nullable = true)
|-- evidenceLevel: string (nullable = true)
|-- genotype: string (nullable = true)
|-- genotypeAnnotationText: string (nullable = true)
|-- genotypeId: string (nullable = true)
|-- haplotypeFromSourceId: string (nullable = true)
|-- haplotypeId: string (nullable = true)
|-- literature: array (nullable = true)
| |-- element: string (containsNull = true)
|-- pgxCategory: string (nullable = true)
|-- phenotypeFromSourceId: string (nullable = true)
|-- phenotypeText: string (nullable = true)
|-- studyId: string (nullable = true)
|-- targetFromSourceId: string (nullable = true)
|-- variantFunctionalConsequenceId: string (nullable = true)
|-- variantRsId: string (nullable = true)
I'll review the ETL step pointed out by @ireneisdoomed
@mbdebian After looking at the outputs after the changes in the drug mapping part of the ETL, I think they need a revision.
I observed missing rows between the PGx input (31,618) and the ETL output (24,573). This discrepancy is due to the way drug information is handled here:
This approach merges evidence that should be kept separate. Let me explain more graphically: | evidenceId | response | drugs |
---|---|---|---|
1 | trait_a | a | |
2 | trait_a | b | |
3 | trait_a | c, d |
The current transformation loses this distinction, resulting in a single evidence record that lists all drugs associated with a given response, independently of whether they are individual or combinatorial: | evidenceId | response | drugs |
---|---|---|---|
1 | trait_a | a, b, c, d |
For this reason, I'd suggest revising this step and keeping the drug array structure and use another strategy like transform
to add the drug ID or adding an ID just before exploding to uniquely identify each evidence.
For QC, I suggest you check the sizes of the drugs
array, that's how I identified the issue:
# Post ETL
+-----------+-----+
|size(drugs)|count|
+-----------+-----+
| 1|20600|
| 6| 79|
| 3| 849|
| 4| 268|
| 8| 3|
| 5| 126|
| 2| 2453|
| 7| 165|
| 10| 27|
| 13| 3|
+-----------+-----+
# Pre ETL
+-----------+-----+
|size(drugs)|count|
+-----------+-----+
| 1|31516|
| 2| 100|
| 3| 2|
+-----------+-----+
I have changed the code to use an operational row ID, and after running the pharmacogenomics step, I can confirm that the drug count stays the same:
Output
scala> outPharma.withColumn("drugsSize", size(col("drugs"))).groupBy("drugsSize").agg(count("*").as("count")).orderBy(col("count").desc).show
+---------+-----+
|drugsSize|count|
+---------+-----+
| 1|42358|
| 2| 164|
| 3| 2|
+---------+-----+
Original pharmacogenomics input dataset
scala> original.withColumn("drugsSize", size(col("drugs"))).groupBy("drugsSize").agg(count("*").as("count")).orderBy(col("count").desc).show
+---------+-----+
|drugsSize|count|
+---------+-----+
| 1|42358|
| 2| 164|
| 3| 2|
+---------+-----+
The output for this run just for pharmacognomics can be found at gs://ot-team/mbernal/data/issue_3205/output
I just corrected the bug in the grouping operation, results are in the same place
gs://ot-team/mbernal/data/issue_3205/output
, the schema is now correct and the cardinality preserved.
And I have also run ETL just for pharmacogenomics, so the data at gs://open-targets-pre-data-releases/24.06/output/etl/**/pharmacogenomics
is up to date.
ETL changes completed, now it needs to be addressed at the API
Hi, the only issue I can see in implementing this into the API is that currently the API resolves a drug
field, which is a Drug type object, by doing a lookup on drugId
. Now that we have multiple drugId's, we break the resolver which is going from one drugId
to one drug
. Do we need to keep drug
in the pharmacogenomics object (I don't think it's actually being used by the frontend)? If so, how should we handle multiple drug objects (I'd have thought a separate list of drugs would be sufficient)? Thanks!
Based on a conversation offline with @jdhayhurst we have set that:
drugId
, drugFromSource
and drug
is to account for the cases where we couldn't map the drug to any IDdrug
at all, it is just displaying drugFromSource
and linking to the respective drug page with drugId
drug
object every time we want to represent a drugGiven this, we propose a new type to represent drugs DrugWithIdentifiers { drug: Drug, drugFromSource: String, drugId: String }
. This object is adaptable to both scenarios:
drug
can be resolveddrugFromSource
Actions:
Pharmacogenomics
endpoint to drugs: Seq[DrugWithIdentifiers]
ChemicalProbes
endpoint to id: DrugWithIdentifiers
and delete drugId
The data model has now been updated:
type Pharmacogenomics {
datasourceId: String
datatypeId: String
evidenceLevel: String
genotype: String
genotypeAnnotationText: String
genotypeId: String
haplotypeFromSourceId: String
haplotypeId: String
literature: [String!]
pgxCategory: String
phenotypeFromSourceId: String
phenotypeText: String
studyId: String
targetFromSourceId: String
variantFunctionalConsequenceId: String
variantRsId: String
isDirectTarget: Boolean!
variantFunctionalConsequence: SequenceOntologyTerm
# Target entity
target: Target
# Drug List
drugs: [DrugWithIdentifiers!]!
}
# Drug with drug identifiers
type DrugWithIdentifiers {
drugId: String
drugFromSource: String
# Drug entity
drug: Drug
}
Looking great! The genes with the maximum number of drugs in combination are ENSG00000244122 or ENSG00000167165, everything looks nice.
@gjmcn Take into account this will have an impact in the variant page work (comment)
released in Platform 24.06
As the data team, I want to handle pharmacogenetic evidence involving drug combinations where currently no drugId is mapped because this will ensure a more accurate representation of all pharmacogenetic evidence.
Background
Our current pharmacogenetics data model is structured to handle evidence involving a single variant/drug interaction. We have encountered a limitation where evidence involving drug combinations is not mapped to any drug IDs. CFTR w/ ivacaftor/lumacaftor is an example.
Important to note that this issue is really limited, I've only found 6 offending cases, all of them related with CFTR variation.
Proposed Solutions
Two potential solutions are considered:
drugFromSource
anddrugId
fields to allow lists. This would be more future-proof, accommodating drug combinations but requires significant alterations to the data model.Given the rarity of the drug combinations, I think that exploding is the most practical solution. The text that accompanies the genotype clearly specifies that the response is observed for the combination, so this field should be taken into consideration as context.
Tasks
This is not meant to be done for 24.03.
Acceptance tests