Enrichment of L2G genetics evidence in approved drug targets

xyg123 commented 5 months ago

As a user, I want to perform a drug target enrichment analysis using genetic evidence to replicate the analysis done in the paper Nature Genetics, 2021 (Extended Figure 7) because this validation step is crucial for ensuring the value of the l2g pipeline and should be repeated after updates to ensure improvements.

Background

It is important to validate whether drug targets are significantly enriched with genetic evidence compared to non-drug targets. This analysis follows the methodology used in the Nature Genetics 2021 paper, specifically Extended Figure 7. However, the exact input for this was not documented. To ensure reproducibility, it is essential to record how the input datasets are filtered and organised for this analysis.

The enrichment test is an important validation step to demonstrate the value of the l2g pipeline. Repeating this test after updates will help ensure that improvements in the pipeline are accurately reflected.

Tasks

[x] Outline the input datasets and their organisation, particularly focusing on the ChEMBL drug information and genetic evidence.
[x] Detail the analysis process, including data merging, filtering criteria, and the construction of contingency tables.
[x] Compare results using different criteria for "genetic evidence" to evaluate the robustness of the findings.
[x] Summarise the interpretation of the results, including the statistical significance and biological relevance.
Acceptance tests

How do we know the task is complete?

When I review the analysis steps, I see a clear, step-by-step outline of the data processing, filtering, and statistical testing methods.
When I compare the results under different "genetic evidence" criteria, I see a comprehensive comparison with conclusions drawn from the findings.

xyg123 commented 5 months ago

Loading and parsing the chembl evidence dataset:

From the Target - Disease evidence platform file, we can load and obtain 624,491 EFO-target-drugId-clinicalPhase evidence strings.

Since we are only interested in the maximum clinical phase reached for each target, regardless of the drugId used, we can collapse this into 120,426 EFO-target-maximumClinicalPhase entries.

Keep only the maximum clinical phase reached

``` windowSpec = Window.partitionBy("diseaseFromSourceMappedId", "targetId") chembl_evidence=session.spark.read.parquet( "platform_evidence/evidence/sourceId_chembl").select( "targetId", "drugId", "clinicalPhase", "diseaseFromSourceMappedId", "diseaseFromSource", "clinicalStatus").withColumn( "maxClinicalPhase", f.max("clinicalPhase").over(windowSpec)).filter( f.col("clinicalPhase") == f.col("maxClinicalPhase")).drop( "clinicalPhase", "drugId").distinct() ```

Next, to ensure the Chembl evidence matches the EFOs found from L2G predictions, we propagate the EFOs through the ontology tree so we can ensure there is a match. We will keep the original EFO that we propagated from, to map back to the original 120,426 so that we don't overpopulate the background.

Propagation.

``` def diseasePropagation(diseases): return diseases.select("id", "parents").withColumn( "diseaseIdPropagated", f.explode_outer(f.concat(f.array(f.col("id")), f.col("parents"))) ) Propagated_diseases=diseasePropagation(session.spark.read.parquet("diseases.parquet")) chembl_evidence_propagated=chembl_evidence.join( Propagated_diseases.withColumnRenamed("id", "diseaseFromSourceMappedId"), on="diseaseFromSourceMappedId", how="inner").withColumnRenamed( "diseaseFromSourceMappedId", "originalDiseaseFromSourceMappedId") ```

Now we can load the L2G predictions, join the studylocus to the study index and obtain their associated EFO terms (exploding when more than one EFO is assigned), in total there are 9,449,465 EFO-target-l2gScore evidence strings from the 2406 release. Joining the L2G predictions to the Chembl dataset, we get 59,271 EFO-target-l2gScore-maximumClinicalPhase entries.

Join L2G to Propagated Chembl.

``` propagated_efo_gene_chembl = prioritised_genes.withColumnRenamed( "diseaseFromSourceMappedId", "diseaseIdPropagated").join( chembl_evidence_propagated, on=["targetId", "diseaseIdPropagated"], how="inner").persist() gene_efo_genetics_chembl=propagated_efo_gene_chembl.select( f.col("targetId"), f.col("diseaseIdPropagated"), f.col("originalDiseaseFromSourceMappedId"), f.col("studyId"), f.col("studyLocusId"), f.col("score"), #l2g score f.col("traitFromSource"), f.col("maxClinicalPhase"), f.col("clinicalStatus") ).distinct() ```

Then we do a right join back to the original Chembl dataset, leaving us with 107,431 EFO-target-l2gScore-maximumClinicalPhase entries, of which 59,271 contains a l2g score.

Join back to original Chembl to obtain background.

``` gene_efo_genetics_chembl_bg=gene_efo_genetics_chembl.join( chembl_evidence.select( f.col("diseaseFromSourceMappedId").alias("originalDiseaseFromSourceMappedId"), f.col("targetId"), f.col("maxClinicalPhase")), on=["targetId", "originalDiseaseFromSourceMappedId", "maxClinicalPhase"], how="right").drop("drugId").distinct().persist() ```

From here, it remains to define a cutoff threshold for what l2g score qualifies as sufficient genetic evidence, and proceed to construct a 2x2 contigency table and perform the fisher's exact test.

xyg123 commented 5 months ago

The contingency table is constructed as shown below, to test if there is enrichment in reaching higher clinical trial phases when there is genetic evidence:

	Successful trial	Unsuccessful
Genetic support	Successful trials with genetics support	Unsuccessful trials with genetics
No genetics	Successful trials WITHOUT genetics	Unsuccessful trails without genetics

Construct contingency table + Fisher's exact test

``` import pandas as pd from scipy.stats import fisher_exact from scipy.stats import norm import numpy as np # Define cutoff cutoff = 0.5 # Add a column to indicate genetic support input_df = gene_efo_genetics_chembl_bg.withColumn( "geneticSupport", f.when(f.col("score") > cutoff, True).otherwise(False) ).persist() phases = [2, 3, 4] results = [] for phase in phases: print(f"Analyzing clinical phase: {phase}") # Calculate N_G and N_negG N_G = input_df.filter(f.col("geneticSupport") == True).count() N_negG = input_df.filter(f.col("geneticSupport") == False).count() # Calculate X_G and X_negG X_G = input_df.filter( (f.col("geneticSupport") == True) & (f.col("maxClinicalPhase") >= phase) ).count() X_negG = input_df.filter( (f.col("geneticSupport") == False) & (f.col("maxClinicalPhase") >= phase) ).count() # Create the contingency table contingency_table = [[X_G, N_G - X_G], [X_negG, N_negG - X_negG]] print(contingency_table) # Perform Fisher's Exact Test odds_ratio, p_value = fisher_exact(contingency_table) # Calculate confidence interval for odds ratio ln_or = np.log(odds_ratio) se_ln_or = np.sqrt(1/contingency_table[0][0] + 1/contingency_table[0][1] + 1/contingency_table[1][0] + 1/contingency_table[1][1]) z = norm.ppf(0.975) # For 95% CI ci_ln_low = ln_or - z * se_ln_or ci_ln_high = ln_or + z * se_ln_or ci_low = np.exp(ci_ln_low) ci_high = np.exp(ci_ln_high) # Store results results.append({ "clinicalPhase": str(phase)+"+", "odds_ratio": odds_ratio, "p_value": p_value, "ci_low": ci_low, "ci_high": ci_high }) # Convert results to DataFrame for display results_table = pd.DataFrame(results) print(results_table) ```

For the data processed above, with a l2g score threshold of 0.5, we obtain the following results:

[[6327, 397], [84450, 16257]]
[[5018, 1706], [50224, 50483]] [[4039, 2685], [33488, 67219]]

clinicalPhase	odds_ratio	p_value	ci_low	ci_high
2+	3.067949	6.561228e-138	2.768269	3.400070
3+	2.956552	0.000000e+00	2.794693	3.127785
4+	3.019482	0.000000e+00	2.870687	3.175989

opentargets / issues

Enrichment of L2G genetics evidence in approved drug targets #3346

Background

Tasks

Acceptance tests