opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Enrichment of L2G genetics evidence in approved drug targets #3346

Closed xyg123 closed 3 months ago

xyg123 commented 5 months ago

As a user, I want to perform a drug target enrichment analysis using genetic evidence to replicate the analysis done in the paper Nature Genetics, 2021 (Extended Figure 7) because this validation step is crucial for ensuring the value of the l2g pipeline and should be repeated after updates to ensure improvements.

Background

It is important to validate whether drug targets are significantly enriched with genetic evidence compared to non-drug targets. This analysis follows the methodology used in the Nature Genetics 2021 paper, specifically Extended Figure 7. However, the exact input for this was not documented. To ensure reproducibility, it is essential to record how the input datasets are filtered and organised for this analysis.

The enrichment test is an important validation step to demonstrate the value of the l2g pipeline. Repeating this test after updates will help ensure that improvements in the pipeline are accurately reflected.

Tasks

How do we know the task is complete?

  1. When I review the analysis steps, I see a clear, step-by-step outline of the data processing, filtering, and statistical testing methods.
  2. When I compare the results under different "genetic evidence" criteria, I see a comprehensive comparison with conclusions drawn from the findings.
xyg123 commented 5 months ago

Loading and parsing the chembl evidence dataset:

From the Target - Disease evidence platform file, we can load and obtain 624,491 EFO-target-drugId-clinicalPhase evidence strings.

Since we are only interested in the maximum clinical phase reached for each target, regardless of the drugId used, we can collapse this into 120,426 EFO-target-maximumClinicalPhase entries.

Keep only the maximum clinical phase reached
``` windowSpec = Window.partitionBy("diseaseFromSourceMappedId", "targetId") chembl_evidence=session.spark.read.parquet( "platform_evidence/evidence/sourceId_chembl").select( "targetId", "drugId", "clinicalPhase", "diseaseFromSourceMappedId", "diseaseFromSource", "clinicalStatus").withColumn( "maxClinicalPhase", f.max("clinicalPhase").over(windowSpec)).filter( f.col("clinicalPhase") == f.col("maxClinicalPhase")).drop( "clinicalPhase", "drugId").distinct() ```

Next, to ensure the Chembl evidence matches the EFOs found from L2G predictions, we propagate the EFOs through the ontology tree so we can ensure there is a match. We will keep the original EFO that we propagated from, to map back to the original 120,426 so that we don't overpopulate the background.

Propagation.
``` def diseasePropagation(diseases): return diseases.select("id", "parents").withColumn( "diseaseIdPropagated", f.explode_outer(f.concat(f.array(f.col("id")), f.col("parents"))) ) Propagated_diseases=diseasePropagation(session.spark.read.parquet("diseases.parquet")) chembl_evidence_propagated=chembl_evidence.join( Propagated_diseases.withColumnRenamed("id", "diseaseFromSourceMappedId"), on="diseaseFromSourceMappedId", how="inner").withColumnRenamed( "diseaseFromSourceMappedId", "originalDiseaseFromSourceMappedId") ```

Now we can load the L2G predictions, join the studylocus to the study index and obtain their associated EFO terms (exploding when more than one EFO is assigned), in total there are 9,449,465 EFO-target-l2gScore evidence strings from the 2406 release. Joining the L2G predictions to the Chembl dataset, we get 59,271 EFO-target-l2gScore-maximumClinicalPhase entries.

Join L2G to Propagated Chembl.
``` propagated_efo_gene_chembl = prioritised_genes.withColumnRenamed( "diseaseFromSourceMappedId", "diseaseIdPropagated").join( chembl_evidence_propagated, on=["targetId", "diseaseIdPropagated"], how="inner").persist() gene_efo_genetics_chembl=propagated_efo_gene_chembl.select( f.col("targetId"), f.col("diseaseIdPropagated"), f.col("originalDiseaseFromSourceMappedId"), f.col("studyId"), f.col("studyLocusId"), f.col("score"), #l2g score f.col("traitFromSource"), f.col("maxClinicalPhase"), f.col("clinicalStatus") ).distinct() ```

Then we do a right join back to the original Chembl dataset, leaving us with 107,431 EFO-target-l2gScore-maximumClinicalPhase entries, of which 59,271 contains a l2g score.

Join back to original Chembl to obtain background.
``` gene_efo_genetics_chembl_bg=gene_efo_genetics_chembl.join( chembl_evidence.select( f.col("diseaseFromSourceMappedId").alias("originalDiseaseFromSourceMappedId"), f.col("targetId"), f.col("maxClinicalPhase")), on=["targetId", "originalDiseaseFromSourceMappedId", "maxClinicalPhase"], how="right").drop("drugId").distinct().persist() ```

From here, it remains to define a cutoff threshold for what l2g score qualifies as sufficient genetic evidence, and proceed to construct a 2x2 contigency table and perform the fisher's exact test.

xyg123 commented 5 months ago

The contingency table is constructed as shown below, to test if there is enrichment in reaching higher clinical trial phases when there is genetic evidence:

Successful trial Unsuccessful
Genetic support Successful trials with genetics support Unsuccessful trials with genetics
No genetics Successful trials WITHOUT genetics Unsuccessful trails without genetics
Construct contingency table + Fisher's exact test
``` import pandas as pd from scipy.stats import fisher_exact from scipy.stats import norm import numpy as np # Define cutoff cutoff = 0.5 # Add a column to indicate genetic support input_df = gene_efo_genetics_chembl_bg.withColumn( "geneticSupport", f.when(f.col("score") > cutoff, True).otherwise(False) ).persist() phases = [2, 3, 4] results = [] for phase in phases: print(f"Analyzing clinical phase: {phase}") # Calculate N_G and N_negG N_G = input_df.filter(f.col("geneticSupport") == True).count() N_negG = input_df.filter(f.col("geneticSupport") == False).count() # Calculate X_G and X_negG X_G = input_df.filter( (f.col("geneticSupport") == True) & (f.col("maxClinicalPhase") >= phase) ).count() X_negG = input_df.filter( (f.col("geneticSupport") == False) & (f.col("maxClinicalPhase") >= phase) ).count() # Create the contingency table contingency_table = [[X_G, N_G - X_G], [X_negG, N_negG - X_negG]] print(contingency_table) # Perform Fisher's Exact Test odds_ratio, p_value = fisher_exact(contingency_table) # Calculate confidence interval for odds ratio ln_or = np.log(odds_ratio) se_ln_or = np.sqrt(1/contingency_table[0][0] + 1/contingency_table[0][1] + 1/contingency_table[1][0] + 1/contingency_table[1][1]) z = norm.ppf(0.975) # For 95% CI ci_ln_low = ln_or - z * se_ln_or ci_ln_high = ln_or + z * se_ln_or ci_low = np.exp(ci_ln_low) ci_high = np.exp(ci_ln_high) # Store results results.append({ "clinicalPhase": str(phase)+"+", "odds_ratio": odds_ratio, "p_value": p_value, "ci_low": ci_low, "ci_high": ci_high }) # Convert results to DataFrame for display results_table = pd.DataFrame(results) print(results_table) ```

For the data processed above, with a l2g score threshold of 0.5, we obtain the following results:

[[6327, 397], [84450, 16257]]
[[5018, 1706], [50224, 50483]] [[4039, 2685], [33488, 67219]]

clinicalPhase odds_ratio p_value ci_low ci_high
2+ 3.067949 6.561228e-138 2.768269 3.400070
3+ 2.956552 0.000000e+00 2.794693 3.127785
4+ 3.019482 0.000000e+00 2.870687 3.175989