Closed xyg123 closed 3 months ago
Loading and parsing the chembl evidence dataset:
From the Target - Disease evidence platform file, we can load and obtain 624,491 EFO-target-drugId-clinicalPhase evidence strings.
Since we are only interested in the maximum clinical phase reached for each target, regardless of the drugId used, we can collapse this into 120,426 EFO-target-maximumClinicalPhase entries.
Next, to ensure the Chembl evidence matches the EFOs found from L2G predictions, we propagate the EFOs through the ontology tree so we can ensure there is a match. We will keep the original EFO that we propagated from, to map back to the original 120,426 so that we don't overpopulate the background.
Now we can load the L2G predictions, join the studylocus to the study index and obtain their associated EFO terms (exploding when more than one EFO is assigned), in total there are 9,449,465 EFO-target-l2gScore evidence strings from the 2406 release. Joining the L2G predictions to the Chembl dataset, we get 59,271 EFO-target-l2gScore-maximumClinicalPhase entries.
Then we do a right join back to the original Chembl dataset, leaving us with 107,431 EFO-target-l2gScore-maximumClinicalPhase entries, of which 59,271 contains a l2g score.
From here, it remains to define a cutoff threshold for what l2g score qualifies as sufficient genetic evidence, and proceed to construct a 2x2 contigency table and perform the fisher's exact test.
The contingency table is constructed as shown below, to test if there is enrichment in reaching higher clinical trial phases when there is genetic evidence:
Successful trial | Unsuccessful | |
---|---|---|
Genetic support | Successful trials with genetics support | Unsuccessful trials with genetics |
No genetics | Successful trials WITHOUT genetics | Unsuccessful trails without genetics |
For the data processed above, with a l2g score threshold of 0.5, we obtain the following results:
[[6327, 397], [84450, 16257]]
[[5018, 1706], [50224, 50483]]
[[4039, 2685], [33488, 67219]]
clinicalPhase | odds_ratio | p_value | ci_low | ci_high |
---|---|---|---|---|
2+ | 3.067949 | 6.561228e-138 | 2.768269 | 3.400070 |
3+ | 2.956552 | 0.000000e+00 | 2.794693 | 3.127785 |
4+ | 3.019482 | 0.000000e+00 | 2.870687 | 3.175989 |
As a user, I want to perform a drug target enrichment analysis using genetic evidence to replicate the analysis done in the paper Nature Genetics, 2021 (Extended Figure 7) because this validation step is crucial for ensuring the value of the l2g pipeline and should be repeated after updates to ensure improvements.
Background
It is important to validate whether drug targets are significantly enriched with genetic evidence compared to non-drug targets. This analysis follows the methodology used in the Nature Genetics 2021 paper, specifically Extended Figure 7. However, the exact input for this was not documented. To ensure reproducibility, it is essential to record how the input datasets are filtered and organised for this analysis.
The enrichment test is an important validation step to demonstrate the value of the l2g pipeline. Repeating this test after updates will help ensure that improvements in the pipeline are accurately reflected.
Tasks
Acceptance tests
How do we know the task is complete?