Dynamically update the studyIds in the L2G training set

xyg123 commented 8 months ago

I want to ensure we use the best studyId for each l2g gold standard, this may change between releases as bigger studies are released. Want to make sure the gold standards in l2g has full coverage of input features, weighted by the latest and strongest fine-mapping results.

@ireneisdoomed does this seem reasonable?

Background

@Daniel-Considine did you implement something similar for the old genetics pipeline? Was there slight improvements in the performance of l2g?

Tasks

[x] Add a join prior to feature generation for l2g gold standards. This join will be between the existing curated gold standards json and the StudyIndex, based on EFO. For each l2g gold standard, change the studyId to a sumstat study, with the largest number of samples and the strongest p-value association.
[ ] Check that the training+prediction steps can run as normal.

Acceptance tests

How do we know the task is complete?

Will be able to check the differences in the studyIds for the gold standards.
Summarise the differences in performance with these changes.

addramir commented 8 months ago

The only one comment i have - we need to be clear with the rule of selecting the best study id. The biggest sample size and the genome wide significant p-value? Just the lowest p-value? The biggest effective sample size? Just the lowest p-value looks interesting for me (Daniel had the rule of the biggest effective sample size before) but again - which variant?

ireneisdoomed commented 7 months ago

Changing the gold standard from a dataset of locus/study associations to locus/trait makes sense to me. Based on that, we want to define which is the best study that represents such association so that we maximise the coverage of feature annotations.

About the definition of the best study, I agree we should use sample size + p value. If effective sample size is the minimum number of sample that achieves 80% statistical power, would it make sense to pick as best study the most stat. significant one that meets effective sample size? It would be a cool method to add to the package

About the methodology, if you want to find the best representative study for the trait/locus associations, don't you have to join between the gold standard and our credible sets (+ study metadata like the sample size and trait) instead of just using the study table?

For context, we are currently losing ~40% of the associations in the gold standard (positive and negative sets) because we don't have a matching credible set (more here). With this work, not only we could recover those associations, but we will improve the annotations for the existing ones.

As a quick example of an association we are not using, the role of this locus in JAK2 and polycitemia vera represented by SAIGE_200_1, a study we no longer have.

# Gold standard
+--------------------+-----------+-------------+---------------+-----------+
|        studyLocusId|    studyId|    variantId|         geneId|    sources|
+--------------------+-----------+-------------+---------------+-----------+
|-9081827344710656443|SAIGE_200_1|9_5049092_G_T|ENSG00000096968|[ChEMBL_IV]|
+--------------------+-----------+-------------+---------------+-----------+

In the credible set, for that region and that trait we currently have 2 examples, from which we could pick GCST90041883 as the best study:

+-------------+------------+--------+--------------+--------------+-------------+----------+
|    variantId|     studyId|    beta|pValueMantissa|pValueExponent|standardError|sampleSize|
+-------------+------------+--------+--------------+--------------+-------------+----------+
|9_5113577_C_T|GCST90041883|0.814154|         3.695|           -18|    0.0937122|    456348|
|9_5239549_G_A|GCST90043939| 0.85605|         1.528|            -9|     0.141698|    450642|
+-------------+------------+--------+--------------+--------------+-------------+----------+

Here we see an example of the problem of which variant to choose to perform the join. 9_5049092_G_T is not a lead, but is representing the same region than the other 2. Maybe there's a more clever solution to better define regions, but the most straightforward solution I would do is to look for the variant in the locus (in this case, it is there for both cases).

On top of that, using the trait in the current L2G has some challenges. I don't expect that all traits are going to be in the study index.

When I looked at Eric Fauman's gold standards, most of them were metabolite measurements with IDs from HMDB that I don't think are covered by EFO.
And for the ones that are in EFO, we will probably have to make sure they are updated as these mappings are quite old.

addramir commented 7 months ago

Agree @ireneisdoomed . We always should ask whether the locus from GS list is genome-wide significant on the study we choose. Naturally, we need to combine three tables - GS list (EFO-variant-gene), credible sets, study index. I don't now the current exact rule of combing with the GS list and CSs but I think it is reasonable to ask whether the variant form GS is presented in the credible set (not necessarily a lead in CS).

xyg123 commented 2 months ago

Will become obsolete once the l2g feature matrix revision effort #3432 is completed. As the model will move towards EFO-gene assignments instead of studyLocus specific.

opentargets / issues