opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

QC locus to gene gold standards processing #3157

Closed ireneisdoomed closed 6 months ago

ireneisdoomed commented 10 months ago

The locus to gene gold standard is a curated dataset of GWAS loci with known gene-trait associations. It integrates expert curation, experimental data, and drug-target correlations to identify genes responsible for disease traits. Having a trustworthy gold standard is crucial for training algorithms to predict causal genes from genetic association data.

After a first implementation of the new pipeline, I've extracted some metrics for QCing purposes. Code is here https://gist.github.com/ireneisdoomed/0fae75382e4541663838a05efd4b0412

Dataset # studyLocusId # geneId Positive GS Status Negative GS Status
Raw gold standard 1,201 451 - -
Parsed gold standard 1,120 403 2,990,893 0
Annotated gold standard for training 14 13 57,898 0

I'll write in following comments the issues identified.

ireneisdoomed commented 10 months ago

Size of annotated gold standard for training is absurd

After joining with the feature matrix, the number of studyLocusId has dramatically dropped to 14, with only 13 distinct genes represented. As a result, any prediction returned by L2G is not valid at this point.

Context

This is how the gold standard annotation process looks like:

image

The feature matrix is based on the full set of studyLocus. And we compute the aggregation of different features at the studyLocusId level. The reduction means that there is a mismatch between the data in the feature matrix and the gold standard data.

Why do we join the gold standards with the feature matrix, instead of annotating all studyLocusId in the gold standard?

What is the problem?

  1. I don't expect to have a complete overlap between the gold standards and the study locus. My current study locus is only composed of GWASCatalog top hits and Finngen's index variants.
  2. However, the reduction is so big that there must be something else going on.
ireneisdoomed commented 10 months ago

Imbalance Positive/Negative GS Status

All study/locus/gene trios have a positive label assigned. This contradicts what I was seeing when testing the pipeline with the production data, where we had a very skewed imbalance towards the negatives.

Context

We need a balanced training set for L2G to learn to distinguish features of causal vs non causal genes. In L2G we initially defined that any gene outside the positives list within a 500kb window of a known positive locus is initially classified as negative.

What is the problem

  1. Since distance is the main proxy to derive negative evidence, my hypothesis is that there is something wrong in any of the input files (probably V2G) to calculate this number. Since I was getting reasonable results before, I think it is a problem of the input data rather than business logic.
ireneisdoomed commented 10 months ago

Reduction in high quality associations and genes

There is a slight decrease from 1201 high-quality associations in the raw gold standard to 1120 in the parsed gold standard. The number of distinct genes has also decreased from 451 to 403.

What is the problem

  1. The parsing process that converts raw data into a L2GGoldStandard datatype might have issues in the business logic that make us lose potentially valuable data.
ireneisdoomed commented 10 months ago

Investigation on the overlap between gold standards and StudyLocus after GS fixes

With the latest fixes in https://github.com/opentargets/genetics_etl_python/pull/255, I have new data to see if the overlap has increased. Inputs:

Starting gold standard: # Positives # Negatives
Raw Gold Standard 1225 15282
Non overlapping associations between GS and StudyLocus: # Positives # Negatives
Raw Gold Standard 992 886
Non overlapping variants between GS and StudyLocus: # Positives # Negatives
Raw Gold Standard 65 15
Non overlapping studies between GS and StudyLocus: # Positives # Negatives
Raw Gold Standard 908 818

Non overlapping studies

Ther eare 1726 studies in the GS falling out of our StudyLocus:

Actions

After discussion today in the Genetics stand up we want:

ireneisdoomed commented 10 months ago

Changes since last comment:

Now I see that 60% (4293/7121) of gold standards -positives and negatives- are in the credible sets, a much more acceptable dataset to work with.

Source for the missing 40%:

Side note regarding the non overlapping 8%

I took the association of _1_145711327_AC/GCST001791 as an example of association we are not extracting from GWASCatalog.

Snippet to extract these associations:

gs_annotated = (
        parsed_curation.df
        .join(all_credible_set.df.select("studyLocusId", f.lit(True).alias("studyLocusinCredSet")), on=["studyLocusId"], how="left")
        .join(all_credible_set.df.select("studyId", f.lit(True).alias("studyinCredSet")), on=["studyId"], how="left")
        .distinct()
        .persist()
)
gs_annotated.filter(f.col("studyLocusinCredset").isNull()).filter(f.col("studyinCredset")).select("variantId", "studyId").distinct().show()
+-----------------+----------+
|        variantId|   studyId|
+-----------------+----------+
|  1_145711327_A_C|GCST001791|
|   4_71748550_G_C|GCST005367|
|  4_155515355_G_A|GCST005194|
|  6_160690668_C_T|GCST005194|
|  5_159347957_C_A|GCST003045|
|  5_159242905_C_T|GCST005527|
|   1_62679768_G_A|GCST002221|
|  6_116016340_G_A|GCST002221|
|  6_160589086_A_G|GCST002221|
|  20_46246518_C_T|GCST007236|
|  20_59084069_A_G|GCST007236|
|  20_46065258_G_A|GCST007236|
|  1_219906686_A_G|GCST002932|
|   12_4219355_G_A|GCST005413|
| 10_112998590_C_T|GCST005413|
|  5_159371067_A_G|GCST004132|
|   10_6082294_C_T|GCST005531|
|   10_6079323_T_C|GCST005531|
|  20_45923216_T_C|GCST003665|
|11_108474788_G_GA|GCST005077|
+-----------------+----------+
ireneisdoomed commented 6 months ago

More work to improve GS coverage is described here https://github.com/opentargets/issues/issues/3261