QC locus to gene gold standards processing

ireneisdoomed commented 10 months ago

The locus to gene gold standard is a curated dataset of GWAS loci with known gene-trait associations. It integrates expert curation, experimental data, and drug-target correlations to identify genes responsible for disease traits. Having a trustworthy gold standard is crucial for training algorithms to predict causal genes from genetic association data.

After a first implementation of the new pipeline, I've extracted some metrics for QCing purposes. Code is here https://gist.github.com/ireneisdoomed/0fae75382e4541663838a05efd4b0412

Dataset	# studyLocusId	# geneId	Positive GS Status	Negative GS Status
Raw gold standard	1,201	451	-	-
Parsed gold standard	1,120	403	2,990,893	0
Annotated gold standard for training	14	13	57,898	0

I'll write in following comments the issues identified.

ireneisdoomed commented 10 months ago

Size of annotated gold standard for training is absurd

After joining with the feature matrix, the number of studyLocusId has dramatically dropped to 14, with only 13 distinct genes represented. As a result, any prediction returned by L2G is not valid at this point.

Context

This is how the gold standard annotation process looks like:

The feature matrix is based on the full set of studyLocus. And we compute the aggregation of different features at the studyLocusId level. The reduction means that there is a mismatch between the data in the feature matrix and the gold standard data.

Why do we join the gold standards with the feature matrix, instead of annotating all studyLocusId in the gold standard?

By design, feature factory works on top of a studyLocus dataset that has passed all the steps of our pipeline.
Biologically speaking, the lead variant in the gold standard is representing a locus. L2G features represent a metric by looking at all variants in a locus. For example, for the case of distance features, we don't simply calculate the distance between the lead and the gene's TSS, we calculate the minimum distance between any variant in the locus and the gene's TSS. Therefore, in my opinion, if we were to annotate the gold standards directly, we would need to define their locus.

What is the problem?

I don't expect to have a complete overlap between the gold standards and the study locus. My current study locus is only composed of GWASCatalog top hits and Finngen's index variants.
However, the reduction is so big that there must be something else going on.

ireneisdoomed commented 10 months ago

Imbalance Positive/Negative GS Status

All study/locus/gene trios have a positive label assigned. This contradicts what I was seeing when testing the pipeline with the production data, where we had a very skewed imbalance towards the negatives.

Context

We need a balanced training set for L2G to learn to distinguish features of causal vs non causal genes. In L2G we initially defined that any gene outside the positives list within a 500kb window of a known positive locus is initially classified as negative.

What is the problem

Since distance is the main proxy to derive negative evidence, my hypothesis is that there is something wrong in any of the input files (probably V2G) to calculate this number. Since I was getting reasonable results before, I think it is a problem of the input data rather than business logic.

ireneisdoomed commented 10 months ago

Reduction in high quality associations and genes

There is a slight decrease from 1201 high-quality associations in the raw gold standard to 1120 in the parsed gold standard. The number of distinct genes has also decreased from 451 to 403.

What is the problem

The parsing process that converts raw data into a L2GGoldStandard datatype might have issues in the business logic that make us lose potentially valuable data.

ireneisdoomed commented 10 months ago

Investigation on the overlap between gold standards and StudyLocus after GS fixes

With the latest fixes in https://github.com/opentargets/genetics_etl_python/pull/255, I have new data to see if the overlap has increased. Inputs:

Gold standard after fixes
StudyLocus after integrating associations derived from 11,000 studies with summary statistics.

Starting gold standard:		# Positives	# Negatives
Raw Gold Standard	1225	15282

Non overlapping associations between GS and StudyLocus:		# Positives	# Negatives
Raw Gold Standard	992	886

Non overlapping variants between GS and StudyLocus:		# Positives	# Negatives
Raw Gold Standard	65	15

Non overlapping studies between GS and StudyLocus:		# Positives	# Negatives
Raw Gold Standard	908	818

Non overlapping studies

Ther eare 1726 studies in the GS falling out of our StudyLocus:

181 from NEALE
175 from SAIGE
1343 from GWAS Catalog
- 398 split in the GS
- 945 non split in the GS
The GS studies that are split are missing because the splitting strategy has changed between the time that gold standard was generated and the studylocus was generated.
The GS studies that are not split in the GS are missing because the studies with summary statistics after splitting contain _1 as a suffix to the studyId even though these are studies studying a single trait.

Actions

After discussion today in the Genetics stand up we want:

[x] To remove generation of any studyId with a _1 suffix
[ ] To generate a new gold standard curation file that follows the same studyId assignation we use for StudyLocus. Hopefully this is just a one off process

ireneisdoomed commented 10 months ago

Changes since last comment:

We have credible sets from 25k studies with summary statistics
The study ID assignation has been fixed, removing the "_1" cases

Now I see that 60% (4293/7121) of gold standards -positives and negatives- are in the credible sets, a much more acceptable dataset to work with.

Source for the missing 40%:

8% (220/2828) of them are associations where we have the study, but the index variant we pick is different. I think that the fact that this number is so low really highlights the confidence in our credible sets.
- To capture these, I can check if the variant in the GS matches to any variant in the locus we define.
92% (2608/2828) of them are associations where we don't have the study:
- 17% (488/2828) are associations from SAIGE
- 22% (636/2828) are associations from NEALE
- 50% (1418/2828) are associations from GCST - all of them are split. This will likely be fixed once I harmonise the studyId assignation as I mention in my previous comment.

Side note regarding the non overlapping 8%

I took the association of _1_145711327_AC/GCST001791 as an example of association we are not extracting from GWASCatalog.

Curated dataset: this variant hasn't been curated, they're rather representing this locus with 1_145711327_T_G
Summary statistics: we get the same peak from processing summary statistics. It is very likely that the 1_145711327_A_C is being clumped in the window based step due to its lower signal.

Snippet to extract these associations:

gs_annotated = (
        parsed_curation.df
        .join(all_credible_set.df.select("studyLocusId", f.lit(True).alias("studyLocusinCredSet")), on=["studyLocusId"], how="left")
        .join(all_credible_set.df.select("studyId", f.lit(True).alias("studyinCredSet")), on=["studyId"], how="left")
        .distinct()
        .persist()
)
gs_annotated.filter(f.col("studyLocusinCredset").isNull()).filter(f.col("studyinCredset")).select("variantId", "studyId").distinct().show()
+-----------------+----------+
|        variantId|   studyId|
+-----------------+----------+
|  1_145711327_A_C|GCST001791|
|   4_71748550_G_C|GCST005367|
|  4_155515355_G_A|GCST005194|
|  6_160690668_C_T|GCST005194|
|  5_159347957_C_A|GCST003045|
|  5_159242905_C_T|GCST005527|
|   1_62679768_G_A|GCST002221|
|  6_116016340_G_A|GCST002221|
|  6_160589086_A_G|GCST002221|
|  20_46246518_C_T|GCST007236|
|  20_59084069_A_G|GCST007236|
|  20_46065258_G_A|GCST007236|
|  1_219906686_A_G|GCST002932|
|   12_4219355_G_A|GCST005413|
| 10_112998590_C_T|GCST005413|
|  5_159371067_A_G|GCST004132|
|   10_6082294_C_T|GCST005531|
|   10_6079323_T_C|GCST005531|
|  20_45923216_T_C|GCST003665|
|11_108474788_G_GA|GCST005077|
+-----------------+----------+

ireneisdoomed commented 6 months ago

More work to improve GS coverage is described here https://github.com/opentargets/issues/issues/3261

opentargets / issues