Closed ireneisdoomed closed 6 months ago
After joining with the feature matrix, the number of studyLocusId has dramatically dropped to 14, with only 13 distinct genes represented. As a result, any prediction returned by L2G is not valid at this point.
This is how the gold standard annotation process looks like:
The feature matrix is based on the full set of studyLocus. And we compute the aggregation of different features at the studyLocusId level. The reduction means that there is a mismatch between the data in the feature matrix and the gold standard data.
Why do we join the gold standards with the feature matrix, instead of annotating all studyLocusId in the gold standard?
All study/locus/gene trios have a positive label assigned. This contradicts what I was seeing when testing the pipeline with the production data, where we had a very skewed imbalance towards the negatives.
We need a balanced training set for L2G to learn to distinguish features of causal vs non causal genes. In L2G we initially defined that any gene outside the positives list within a 500kb window of a known positive locus is initially classified as negative.
There is a slight decrease from 1201 high-quality associations in the raw gold standard to 1120 in the parsed gold standard. The number of distinct genes has also decreased from 451 to 403.
L2GGoldStandard
datatype might have issues in the business logic that make us lose potentially valuable data.With the latest fixes in https://github.com/opentargets/genetics_etl_python/pull/255, I have new data to see if the overlap has increased. Inputs:
Starting gold standard: | # Positives | # Negatives | |
---|---|---|---|
Raw Gold Standard | 1225 | 15282 |
Non overlapping associations between GS and StudyLocus: | # Positives | # Negatives | |
---|---|---|---|
Raw Gold Standard | 992 | 886 |
Non overlapping variants between GS and StudyLocus: | # Positives | # Negatives | |
---|---|---|---|
Raw Gold Standard | 65 | 15 |
Non overlapping studies between GS and StudyLocus: | # Positives | # Negatives | |
---|---|---|---|
Raw Gold Standard | 908 | 818 |
Ther eare 1726 studies in the GS falling out of our StudyLocus:
181 from NEALE
175 from SAIGE
1343 from GWAS Catalog
The GS studies that are split are missing because the splitting strategy has changed between the time that gold standard was generated and the studylocus was generated.
The GS studies that are not split in the GS are missing because the studies with summary statistics after splitting contain _1
as a suffix to the studyId even though these are studies studying a single trait.
After discussion today in the Genetics stand up we want:
_1
suffixChanges since last comment:
Now I see that 60% (4293/7121) of gold standards -positives and negatives- are in the credible sets, a much more acceptable dataset to work with.
Source for the missing 40%:
I took the association of _1_145711327_AC/GCST001791 as an example of association we are not extracting from GWASCatalog.
1_145711327_T_G
Snippet to extract these associations:
gs_annotated = (
parsed_curation.df
.join(all_credible_set.df.select("studyLocusId", f.lit(True).alias("studyLocusinCredSet")), on=["studyLocusId"], how="left")
.join(all_credible_set.df.select("studyId", f.lit(True).alias("studyinCredSet")), on=["studyId"], how="left")
.distinct()
.persist()
)
gs_annotated.filter(f.col("studyLocusinCredset").isNull()).filter(f.col("studyinCredset")).select("variantId", "studyId").distinct().show()
+-----------------+----------+
| variantId| studyId|
+-----------------+----------+
| 1_145711327_A_C|GCST001791|
| 4_71748550_G_C|GCST005367|
| 4_155515355_G_A|GCST005194|
| 6_160690668_C_T|GCST005194|
| 5_159347957_C_A|GCST003045|
| 5_159242905_C_T|GCST005527|
| 1_62679768_G_A|GCST002221|
| 6_116016340_G_A|GCST002221|
| 6_160589086_A_G|GCST002221|
| 20_46246518_C_T|GCST007236|
| 20_59084069_A_G|GCST007236|
| 20_46065258_G_A|GCST007236|
| 1_219906686_A_G|GCST002932|
| 12_4219355_G_A|GCST005413|
| 10_112998590_C_T|GCST005413|
| 5_159371067_A_G|GCST004132|
| 10_6082294_C_T|GCST005531|
| 10_6079323_T_C|GCST005531|
| 20_45923216_T_C|GCST003665|
|11_108474788_G_GA|GCST005077|
+-----------------+----------+
More work to improve GS coverage is described here https://github.com/opentargets/issues/issues/3261
The locus to gene gold standard is a curated dataset of GWAS loci with known gene-trait associations. It integrates expert curation, experimental data, and drug-target correlations to identify genes responsible for disease traits. Having a trustworthy gold standard is crucial for training algorithms to predict causal genes from genetic association data.
After a first implementation of the new pipeline, I've extracted some metrics for QCing purposes. Code is here https://gist.github.com/ireneisdoomed/0fae75382e4541663838a05efd4b0412
I'll write in following comments the issues identified.