opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Clumping GWAS Catalog top hits #3467

Closed d0choa closed 2 months ago

d0choa commented 2 months ago

GWAS Catalog top-hits don't have any clumping strategy. If a GWAS catalog study reports many associations within an region/haplotype we have no way to control. This might result on an artificially high number of credible sets resulted from PICS.

Next, an example on GCST90321118

Image

We want to perform clumping on these credible sets, but we still need to scope the technical strategy to implement this.

addramir commented 2 months ago

I think the best person to implement this is @DSuveges

DSuveges commented 2 months ago

As a related issue, it has been identified that the large number of credible sets in the PICSed GWAS Catalog curated dataset can be partially explained by a bug in the LD clumping method. However after fixing that issue, there are still a relatively high number of associations on chromosome 7 of this study:

+---------------------------------------------------------------------------------------------------------------------+------------+----------+---------+--------------+--------------+------------------------------------------------------------------+
|variantId                                                                                                            |studyId     |chromosome|position |pValueMantissa|pValueExponent|qualityControls                                                   |
+---------------------------------------------------------------------------------------------------------------------+------------+----------+---------+--------------+--------------+------------------------------------------------------------------+
|7_37918687_G_A                                                                                                       |GCST90321118|7         |37918687 |1.0           |-8            |[]                                                                |
|7_38060307_C_CT                                                                                                      |GCST90321118|7         |38060307 |5.0           |-18           |[]                                                                |
|7_38109854_T_TA                                                                                                      |GCST90321118|7         |38109854 |6.0           |-12           |[Variant not found in LD reference]                               |
|7_38113261_A_G                                                                                                       |GCST90321118|7         |38113261 |2.0           |-16           |[]                                                                |
|7_96514529_A_AC                                                                                                      |GCST90321118|7         |96514529 |2.0           |-9            |[]                                                                |
|7_121084734_C_T                                                                                                      |GCST90321118|7         |121084734|6.0           |-10           |[]                                                                |
|7_121117073_C_T                                                                                                      |GCST90321118|7         |121117073|2.0           |-11           |[Variant not found in LD reference]                               |
|7_121241062_G_GAATTGGATGGAAAAATAAGCACTTTTGAGGAAGATAATCTTTATTTTGCCATTCAAAAACCAGCATCTCTCCTAAATTTTCTGTTGTTTCTTTTAGCAGTAC|GCST90321118|7         |121241062|1.0           |-34           |[Variant not found in LD reference]                               |
|7_121241063_G_GGATGGAAAAATAAGCACTTTTGAGGAAGATAATCTTTATTTTGCCATTCAAAAACCAGCATCTCT                                     |GCST90321118|7         |121241063|1.0           |-34           |[Variant not found in LD reference]                               |
|7_121241065_C_CATTCAAAAACCAGCATCTCTCCTAAATTTTCTGTTGTTTCTTTTAGCA                                                      |GCST90321118|7         |121241065|1.0           |-34           |[Variant not found in LD reference]                               |
|7_121241065_C_T                                                                                                      |GCST90321118|7         |121241065|1.0           |-34           |[Variant not found in LD reference]                               |
|7_121251832_A_G                                                                                                      |GCST90321118|7         |121251832|3.0           |-32           |[Variant not found in LD reference]                               |
|7_121313702_G_A                                                                                                      |GCST90321118|7         |121313702|1.0           |-14           |[]                                                                |
|7_121320217_G_C                                                                                                      |GCST90321118|7         |121320217|1.0           |-126          |[Palindrome alleles - cannot harmonize]                           |
|7_121325298_C_T                                                                                                      |GCST90321118|7         |121325298|1.0           |-19           |[]                                                                |
|7_121325508_A_G                                                                                                      |GCST90321118|7         |121325508|7.0           |-13           |[]                                                                |
|7_121327159_A_T                                                                                                      |GCST90321118|7         |121327159|3.0           |-25           |[Palindrome alleles - cannot harmonize]                           |
|7_121364935_G_A                                                                                                      |GCST90321118|7         |121364935|6.0           |-15           |[]                                                                |
|7_121373353_T_G                                                                                                      |GCST90321118|7         |121373353|2.0           |-12           |[]                                                                |
|7_121386694_T_C                                                                                                      |GCST90321118|7         |121386694|6.0           |-9            |[LD block does not contain variants at the required R^2 threshold]|
+---------------------------------------------------------------------------------------------------------------------+------------+----------+---------+--------------+--------------+------------------------------------------------------------------+
only showing top 20 rows

Some of these credible sets were included with the flag: lead not found in credible set, however when this flag is absent, the returned LD sets were not overlapping.

This observed behaviour justifies the application an extra, window based clumping on the GWAS Catalog curated associations.

DSuveges commented 2 months ago

Test dataset showing a StudyLocus dataset before and after window based clumping:

+-------+----------+--------+--------------+--------------------+--------------+---------+
|studyId|chromosome|position|pValueExponent|        studyLocusId|pValueMantissa|variantId|
+-------+----------+--------+--------------+--------------------+--------------+---------+
|     s1|        c1|       1|            -1|  816176356781534521|           1.0|       v1|
|     s1|        c1|       3|            -3| -206100010007302174|           1.0|       v4|
|     s1|        c1|       2|            -2|-4721564960210010127|           1.0|       v2|
|     s1|        c2|       2|            -2|-2919469633967748933|           1.0|       v3|
|     s3|        c2|       2|            -2| 6166427946174414045|           1.0|       v1|
+-------+----------+--------+--------------+--------------------+--------------+---------+

In [2]: sl.window_based_clumping(3).df.show()
+-------+----------+--------+--------------+--------------------+--------------+---------+------------------------------------------------------------+
|studyId|chromosome|position|pValueExponent|studyLocusId        |pValueMantissa|variantId|qualityControls                                             |
+-------+----------+--------+--------------+--------------------+--------------+---------+------------------------------------------------------------+
|s1     |c1        |1       |-1            |1740131172091600674 |1.0           |v1       |[Explained by a more significant variant in the same window]|
|s1     |c1        |2       |-2            |-6342038754064840370|1.0           |v2       |[Explained by a more significant variant in the same window]|
|s1     |c1        |3       |-3            |-3040002280507636093|1.0           |v4       |[]                                                          |
|s1     |c2        |2       |-2            |8923642814302707841 |1.0           |v3       |[]                                                          |
|s3     |c2        |2       |-2            |-218747710423759089 |1.0           |v1       |[]                                                          |
+-------+----------+--------+--------------+--------------------+--------------+---------+------------------------------------------------------------+

(The studyLocus Id is changed, but that's fine)