ireneisdoomed / issues

0 stars 0 forks source link

287,382 associations with a null credible set #2

Open ireneisdoomed opened 1 year ago

ireneisdoomed commented 1 year ago

The GWAS Catalog study/locus table contains 287,382 associations with a null credible set, which corresponds with 170,156 unique offending variants. These are produced in the annotate_ld step of the pipeline.

The only reason I can think of for having an empty credible set is that there is not LD information for a variant in the Gnomad's LD matrix. This amount of nulls largely surpass my expectations.

I want to see if this is the result of another bug, or if the LD annotation is correct and indeed we don't have an LD set for those variants.

ireneisdoomed commented 1 year ago

After checking 30 offending cases I have found:

+--------------------+
|           variantId|
+--------------------+
|     10_24562781_G_A| --> no LD set
|      2_21239871_T_C| --> no LD set
|     21_41646112_G_A| --> no LD set
|     6_130990769_C_T| --> with LD set
|      9_95507889_A_G| --> with LD set
|      1_13491073_C_T|  --> no LD set
|    1_109275537_C_CT|
|10_62794413_CAAAA...|
|      11_1867384_C_T| --> with LD set
|    12_122659642_A_G|  --> with LD set
|     18_10695159_A_G|  --> no LD set
|     18_55324905_G_A|  --> no LD set
|      2_26993105_C_T|  --> with LD set
|     5_112653249_T_A| --> with LD set
|   6_56014117_T_TCAG|
|       6_6767190_G_T|  --> no LD set
|      6_43360012_A_G|  --> no LD set
|     7_100329813_A_G|  --> with LD set
|    11_116208580_C_G|  --> with LD set
|    11_115690178_C_T|   --> no LD set
|     12_88442438_G_T|  --> with LD set
|     13_69184292_T_C|  --> no LD set
|    14_102813458_C_T|  --> with LD set
|     15_83946048_G_C|  --> no LD set
|     19_45356900_G_A|  --> no LD set
|     20_35547876_A_G|
|      3_18678552_A_C|
|      6_98843766_A_C|. --> no LD set
|     7_142770582_A_G|
|      7_43397280_G_A|  --> no LD set
+--------------------+

I have selected 3 variants, and queried the LD matrix directly, and I see that I have a non null LD set for all of them:

+----------+------------+---------------+--------------------+-----------+      
|chromosome|     studyId|      variantId|         credibleSet|credSetSize|
+----------+------------+---------------+--------------------+-----------+
|         6|GCST004326_1|6_130990769_C_T|[{null, null, nul...|        111|
|         1|GCST002539_1| 1_13491073_C_T|[{null, null, nul...|          2|
|         9|GCST000062_1| 9_95507889_A_G|[{null, null, nul...|         51|
+----------+------------+---------------+--------------------+-----------+
ireneisdoomed commented 1 year ago

I see again discrepancy in the results outputted from the pipeline and my test data.

In my test script I am just querying the European LD matrix, so I've included all of them just in case the fact that the variants were not part of any of the other matrices could be the problem. It is not, the function works despite not having LD info. This is what I get:

+---------------+----------------+                                              
|      variantId|gnomadPopulation|
+---------------+----------------+
| 1_13491073_C_T|             nfe|
|6_130990769_C_T|             nfe|
| 9_95507889_A_G|             eas|
| 9_95507889_A_G|             nfe|
+---------------+----------------+
ireneisdoomed commented 1 year ago

After sending a new job, with the study_locus confined to only include my testing three variants, I see the same non null credible set information.

+---------------+-----------------+
|      variantId|size(credibleSet)|
+---------------+-----------------+
|6_130990769_C_T|              111|
|6_130990769_C_T|              111|
| 1_13491073_C_T|                2|
| 9_95507889_A_G|               51|
| 9_95507889_A_G|               47|
|6_130990769_C_T|              111|
+---------------+-----------------+

The associations are also correctly annotated and clumped. Something very silly must be going on.

ireneisdoomed commented 1 year ago

Since I am not able to reproduce the problem, I will rerun the pipeline with a bigger set of data. The only thing I can think of is the fact that the data is partitioned differently in my 2 runs:

Testing with GCST006571_1

GCST006571_1 is a study with 1343 associated variants. I have:

The results of the job are practically perfect:

The debugging datasets are stored here gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/catalog_study_locus_debugging (sl_clumped, sl_finemapped, sl_ld_annotated)

ireneisdoomed commented 1 year ago

I've rerun the pipeline on the whole dataset (job) producing some checkpoints:

I don't understand the results.

The only thing I can think of is that the LD results depend on the amount of queried variants.

ireneisdoomed commented 1 year ago

Is there any study for which:

Yes, 6296 of them. Most of them have a single population, but the spectrum is large:

+---------------+-----+
|n_of_ancestries|count|
+---------------+-----+
|              1| 5396|
|              2|  445|
|              3|  179|
|              4|  126|
|              5|   93|
|              6|   41|
|              7|   12|
|              8|    3|
|             11|    1|
+---------------+-----+

GCST010921_1 is one example with 4 diff ancestries:

+----------+------------+-----------------+------------+---------+----+---------+---------------------------+---------------------------+--------------------------------+--------------------------------+--------------+--------------+-------------------+-----------------+--------------------+
|chromosome|     studyId|        variantId|studyLocusId| position|beta|oddsRatio|betaConfidenceIntervalLower|betaConfidenceIntervalUpper|oddsRatioConfidenceIntervalLower|oddsRatioConfidenceIntervalUpper|pValueMantissa|pValueExponent|subStudyDescription|finemappingMethod|         credibleSet|
+----------+------------+-----------------+------------+---------+----+---------+---------------------------+---------------------------+--------------------------------+--------------------------------+--------------+--------------+-------------------+-----------------+--------------------+
|         8|GCST010921_1|   8_87846458_T_C|515396081129| 87846458|0.64|     0.64|         0.4529442801568594|         0.8270557198431406|              0.5617347755047597|              0.7291697396372596|           2.0|           -11|               null|             null|                null|
|         6|GCST010921_1|6_151691713_C_CTT|515396081127|151691713|0.08|     0.08|       0.053857111553830786|        0.10614288844616922|              0.0350457239426796|             0.18261857025603945|           2.0|            -9|               null|             null|                  []|
|         2|GCST010921_1|  2_208510209_C_A|515396081124|208510209|0.08|     0.08|       0.053183618846835076|        0.10681638115316493|             0.03430840471174288|             0.18654321160579762|           5.0|            -9|               null|             null|                  []|
|         4|GCST010921_1|    4_1013447_C_T|515396081126|  1013447| 0.1|      0.1|        0.06579841823366375|        0.13420158176633626|            0.045497148909697104|             0.21979399236308278|           1.0|            -8|               null|             null|                  []|
|         7|GCST010921_1| 7_121382550_GT_G|515396081128|121382550|0.07|     0.07|        0.04555239392245196|        0.09444760607754805|             0.02765328823277266|             0.17719411734163604|           2.0|            -8|               null|             null|                  []|
|        16|GCST010921_1|  16_86681109_G_C|515396081131| 86681109|null|     null|                       null|                       null|                            null|                            null|           3.0|            -8|               null|             null|[{true, true, nul...|
|         4|GCST010921_1|    4_1000626_G_A|515396081125|  1000626|0.11|     0.11|        0.07335946414470998|        0.14664053585529002|            0.052733261570389836|             0.22945669658321027|           4.0|            -9|               null|             null|[{true, true, nul...|
|         8|GCST010921_1|  8_118914750_A_G|515396081130|118914750| 0.1|      0.1|        0.06969205376215726|        0.13030794623784275|             0.04976460229089875|             0.20094604477184508|           1.0|           -10|               null|             null|[{true, true, nul...|
+----------+------------+-----------------+------------+---------+----+---------+---------------------------+---------------------------+--------------------------------+--------------------------------+--------------+--------------+-------------------+-----------------+--------------------+

So whatever the problem is (assuming there's a single cause), it affects the logic intra study. I'm going to see if I find any pattern between these loci.