Closed d0choa closed 2 months ago
I think the best person to implement this is @DSuveges
As a related issue, it has been identified that the large number of credible sets in the PICSed GWAS Catalog curated dataset can be partially explained by a bug in the LD clumping method. However after fixing that issue, there are still a relatively high number of associations on chromosome 7 of this study:
+---------------------------------------------------------------------------------------------------------------------+------------+----------+---------+--------------+--------------+------------------------------------------------------------------+
|variantId |studyId |chromosome|position |pValueMantissa|pValueExponent|qualityControls |
+---------------------------------------------------------------------------------------------------------------------+------------+----------+---------+--------------+--------------+------------------------------------------------------------------+
|7_37918687_G_A |GCST90321118|7 |37918687 |1.0 |-8 |[] |
|7_38060307_C_CT |GCST90321118|7 |38060307 |5.0 |-18 |[] |
|7_38109854_T_TA |GCST90321118|7 |38109854 |6.0 |-12 |[Variant not found in LD reference] |
|7_38113261_A_G |GCST90321118|7 |38113261 |2.0 |-16 |[] |
|7_96514529_A_AC |GCST90321118|7 |96514529 |2.0 |-9 |[] |
|7_121084734_C_T |GCST90321118|7 |121084734|6.0 |-10 |[] |
|7_121117073_C_T |GCST90321118|7 |121117073|2.0 |-11 |[Variant not found in LD reference] |
|7_121241062_G_GAATTGGATGGAAAAATAAGCACTTTTGAGGAAGATAATCTTTATTTTGCCATTCAAAAACCAGCATCTCTCCTAAATTTTCTGTTGTTTCTTTTAGCAGTAC|GCST90321118|7 |121241062|1.0 |-34 |[Variant not found in LD reference] |
|7_121241063_G_GGATGGAAAAATAAGCACTTTTGAGGAAGATAATCTTTATTTTGCCATTCAAAAACCAGCATCTCT |GCST90321118|7 |121241063|1.0 |-34 |[Variant not found in LD reference] |
|7_121241065_C_CATTCAAAAACCAGCATCTCTCCTAAATTTTCTGTTGTTTCTTTTAGCA |GCST90321118|7 |121241065|1.0 |-34 |[Variant not found in LD reference] |
|7_121241065_C_T |GCST90321118|7 |121241065|1.0 |-34 |[Variant not found in LD reference] |
|7_121251832_A_G |GCST90321118|7 |121251832|3.0 |-32 |[Variant not found in LD reference] |
|7_121313702_G_A |GCST90321118|7 |121313702|1.0 |-14 |[] |
|7_121320217_G_C |GCST90321118|7 |121320217|1.0 |-126 |[Palindrome alleles - cannot harmonize] |
|7_121325298_C_T |GCST90321118|7 |121325298|1.0 |-19 |[] |
|7_121325508_A_G |GCST90321118|7 |121325508|7.0 |-13 |[] |
|7_121327159_A_T |GCST90321118|7 |121327159|3.0 |-25 |[Palindrome alleles - cannot harmonize] |
|7_121364935_G_A |GCST90321118|7 |121364935|6.0 |-15 |[] |
|7_121373353_T_G |GCST90321118|7 |121373353|2.0 |-12 |[] |
|7_121386694_T_C |GCST90321118|7 |121386694|6.0 |-9 |[LD block does not contain variants at the required R^2 threshold]|
+---------------------------------------------------------------------------------------------------------------------+------------+----------+---------+--------------+--------------+------------------------------------------------------------------+
only showing top 20 rows
Some of these credible sets were included with the flag: lead not found in credible set
, however when this flag is absent, the returned LD sets were not overlapping.
This observed behaviour justifies the application an extra, window based clumping on the GWAS Catalog curated associations.
Test dataset showing a StudyLocus dataset before and after window based clumping:
+-------+----------+--------+--------------+--------------------+--------------+---------+
|studyId|chromosome|position|pValueExponent| studyLocusId|pValueMantissa|variantId|
+-------+----------+--------+--------------+--------------------+--------------+---------+
| s1| c1| 1| -1| 816176356781534521| 1.0| v1|
| s1| c1| 3| -3| -206100010007302174| 1.0| v4|
| s1| c1| 2| -2|-4721564960210010127| 1.0| v2|
| s1| c2| 2| -2|-2919469633967748933| 1.0| v3|
| s3| c2| 2| -2| 6166427946174414045| 1.0| v1|
+-------+----------+--------+--------------+--------------------+--------------+---------+
In [2]: sl.window_based_clumping(3).df.show()
+-------+----------+--------+--------------+--------------------+--------------+---------+------------------------------------------------------------+
|studyId|chromosome|position|pValueExponent|studyLocusId |pValueMantissa|variantId|qualityControls |
+-------+----------+--------+--------------+--------------------+--------------+---------+------------------------------------------------------------+
|s1 |c1 |1 |-1 |1740131172091600674 |1.0 |v1 |[Explained by a more significant variant in the same window]|
|s1 |c1 |2 |-2 |-6342038754064840370|1.0 |v2 |[Explained by a more significant variant in the same window]|
|s1 |c1 |3 |-3 |-3040002280507636093|1.0 |v4 |[] |
|s1 |c2 |2 |-2 |8923642814302707841 |1.0 |v3 |[] |
|s3 |c2 |2 |-2 |-218747710423759089 |1.0 |v1 |[] |
+-------+----------+--------+--------------+--------------------+--------------+---------+------------------------------------------------------------+
(The studyLocus Id is changed, but that's fine)
GWAS Catalog top-hits don't have any clumping strategy. If a GWAS catalog study reports many associations within an region/haplotype we have no way to control. This might result on an artificially high number of credible sets resulted from PICS.
Next, an example on GCST90321118
We want to perform clumping on these credible sets, but we still need to scope the technical strategy to implement this.