287,382 associations with a null credible set

ireneisdoomed commented 1 year ago

The GWAS Catalog study/locus table contains 287,382 associations with a null credible set, which corresponds with 170,156 unique offending variants. These are produced in the annotate_ld step of the pipeline.

The only reason I can think of for having an empty credible set is that there is not LD information for a variant in the Gnomad's LD matrix. This amount of nulls largely surpass my expectations.

I want to see if this is the result of another bug, or if the LD annotation is correct and indeed we don't have an LD set for those variants.

ireneisdoomed commented 1 year ago

After checking 30 offending cases I have found:

majority of cases variants don't have a LD set in the Genetics Portal (cases without a comment are variants not available in production)
10 variants do have information in production, so potentially there is a bug in the pipeline

+--------------------+
|           variantId|
+--------------------+
|     10_24562781_G_A| --> no LD set
|      2_21239871_T_C| --> no LD set
|     21_41646112_G_A| --> no LD set
|     6_130990769_C_T| --> with LD set
|      9_95507889_A_G| --> with LD set
|      1_13491073_C_T|  --> no LD set
|    1_109275537_C_CT|
|10_62794413_CAAAA...|
|      11_1867384_C_T| --> with LD set
|    12_122659642_A_G|  --> with LD set
|     18_10695159_A_G|  --> no LD set
|     18_55324905_G_A|  --> no LD set
|      2_26993105_C_T|  --> with LD set
|     5_112653249_T_A| --> with LD set
|   6_56014117_T_TCAG|
|       6_6767190_G_T|  --> no LD set
|      6_43360012_A_G|  --> no LD set
|     7_100329813_A_G|  --> with LD set
|    11_116208580_C_G|  --> with LD set
|    11_115690178_C_T|   --> no LD set
|     12_88442438_G_T|  --> with LD set
|     13_69184292_T_C|  --> no LD set
|    14_102813458_C_T|  --> with LD set
|     15_83946048_G_C|  --> no LD set
|     19_45356900_G_A|  --> no LD set
|     20_35547876_A_G|
|      3_18678552_A_C|
|      6_98843766_A_C|. --> no LD set
|     7_142770582_A_G|
|      7_43397280_G_A|  --> no LD set
+--------------------+

I have selected 3 variants, and queried the LD matrix directly, and I see that I have a non null LD set for all of them:

6_130990769_C_T, 9_95507889_A_G: variants with LD info in production
1_13491073_C_T: variant without LD info in production, here we'd gain resolution by using Gnomad vs 1000 Genomes matrices.

+----------+------------+---------------+--------------------+-----------+      
|chromosome|     studyId|      variantId|         credibleSet|credSetSize|
+----------+------------+---------------+--------------------+-----------+
|         6|GCST004326_1|6_130990769_C_T|[{null, null, nul...|        111|
|         1|GCST002539_1| 1_13491073_C_T|[{null, null, nul...|          2|
|         9|GCST000062_1| 9_95507889_A_G|[{null, null, nul...|         51|
+----------+------------+---------------+--------------------+-----------+

ireneisdoomed commented 1 year ago

I see again discrepancy in the results outputted from the pipeline and my test data.

In my test script I am just querying the European LD matrix, so I've included all of them just in case the fact that the variants were not part of any of the other matrices could be the problem. It is not, the function works despite not having LD info. This is what I get:

+---------------+----------------+                                              
|      variantId|gnomadPopulation|
+---------------+----------------+
| 1_13491073_C_T|             nfe|
|6_130990769_C_T|             nfe|
| 9_95507889_A_G|             eas|
| 9_95507889_A_G|             nfe|
+---------------+----------------+

ireneisdoomed commented 1 year ago

After sending a new job, with the study_locus confined to only include my testing three variants, I see the same non null credible set information.

+---------------+-----------------+
|      variantId|size(credibleSet)|
+---------------+-----------------+
|6_130990769_C_T|              111|
|6_130990769_C_T|              111|
| 1_13491073_C_T|                2|
| 9_95507889_A_G|               51|
| 9_95507889_A_G|               47|
|6_130990769_C_T|              111|
+---------------+-----------------+

The associations are also correctly annotated and clumped. Something very silly must be going on.

ireneisdoomed commented 1 year ago

Since I am not able to reproduce the problem, I will rerun the pipeline with a bigger set of data. The only thing I can think of is the fact that the data is partitioned differently in my 2 runs:

when i only use test data, I assume that everything is in the same partition
with the whole dataset, associations are distributed

Testing with GCST006571_1

GCST006571_1 is a study with 1343 associated variants. I have:

1194 rows (studyLocusId) with a null credible set
149 rows without a null credible set

The results of the job are practically perfect:

We have 1342 non null credible sets for all steps: LD annotation, fine mapping and clumping

The null one makes sense because the variant is not part of the LD index:

-RECORD 0------------------------------------------------
chromosome                       | 3                    
studyId                          | GCST006571_1         
variantId                        | 3_16824346_C_T       
studyLocusId                     | 412316862676         
position                         | 16824346             
beta                             | 0.0128               
oddsRatio                        | 0.0128               
betaConfidenceIntervalLower      | 0.01012822160993422  
betaConfidenceIntervalUpper      | 0.015471778390065782 
oddsRatioConfidenceIntervalLower | 0.005153743750855497 
oddsRatioConfidenceIntervalUpper | 0.031790482398897535 
pValueMantissa                   | 6.0                  
pValueExponent                   | -21                  
qualityControls                  | []                   
subStudyDescription              | null                 
finemappingMethod                | null                 
credibleSet                      | null

The debugging datasets are stored here gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/catalog_study_locus_debugging (sl_clumped, sl_finemapped, sl_ld_annotated)

ireneisdoomed commented 1 year ago

I've rerun the pipeline on the whole dataset (job) producing some checkpoints:

Study Locus (LD annotation + finemapping + clumping): gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/catalog_study_locus_debugging/catalog_study_locus
LD set (df with unique study/locus/ancestry + the LD information aggregated per study/locus): gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/catalog_study_locus_debugging/ld_set
LD R info (very similar to above, but less processed, it reflects the results of the LD matrices): gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/catalog_study_locus_debugging/ld_r

I don't understand the results.

Only 19,437 associations with null credible set
292,790 associations with empty credible set
None of my variants are in LD R or LD Set
Extra metrics:
- 16,831,394 lead/tag pairs in LD R
- 132,650 studyLocus pairs in LD Set

The only thing I can think of is that the LD results depend on the amount of queried variants.

ireneisdoomed commented 1 year ago

Is there any study for which:

there are loci with a null credible set
there are loci with an empty credible set
there are loci with a healthy credible set?

Yes, 6296 of them. Most of them have a single population, but the spectrum is large:

+---------------+-----+
|n_of_ancestries|count|
+---------------+-----+
|              1| 5396|
|              2|  445|
|              3|  179|
|              4|  126|
|              5|   93|
|              6|   41|
|              7|   12|
|              8|    3|
|             11|    1|
+---------------+-----+

GCST010921_1 is one example with 4 diff ancestries:

+----------+------------+-----------------+------------+---------+----+---------+---------------------------+---------------------------+--------------------------------+--------------------------------+--------------+--------------+-------------------+-----------------+--------------------+
|chromosome|     studyId|        variantId|studyLocusId| position|beta|oddsRatio|betaConfidenceIntervalLower|betaConfidenceIntervalUpper|oddsRatioConfidenceIntervalLower|oddsRatioConfidenceIntervalUpper|pValueMantissa|pValueExponent|subStudyDescription|finemappingMethod|         credibleSet|
+----------+------------+-----------------+------------+---------+----+---------+---------------------------+---------------------------+--------------------------------+--------------------------------+--------------+--------------+-------------------+-----------------+--------------------+
|         8|GCST010921_1|   8_87846458_T_C|515396081129| 87846458|0.64|     0.64|         0.4529442801568594|         0.8270557198431406|              0.5617347755047597|              0.7291697396372596|           2.0|           -11|               null|             null|                null|
|         6|GCST010921_1|6_151691713_C_CTT|515396081127|151691713|0.08|     0.08|       0.053857111553830786|        0.10614288844616922|              0.0350457239426796|             0.18261857025603945|           2.0|            -9|               null|             null|                  []|
|         2|GCST010921_1|  2_208510209_C_A|515396081124|208510209|0.08|     0.08|       0.053183618846835076|        0.10681638115316493|             0.03430840471174288|             0.18654321160579762|           5.0|            -9|               null|             null|                  []|
|         4|GCST010921_1|    4_1013447_C_T|515396081126|  1013447| 0.1|      0.1|        0.06579841823366375|        0.13420158176633626|            0.045497148909697104|             0.21979399236308278|           1.0|            -8|               null|             null|                  []|
|         7|GCST010921_1| 7_121382550_GT_G|515396081128|121382550|0.07|     0.07|        0.04555239392245196|        0.09444760607754805|             0.02765328823277266|             0.17719411734163604|           2.0|            -8|               null|             null|                  []|
|        16|GCST010921_1|  16_86681109_G_C|515396081131| 86681109|null|     null|                       null|                       null|                            null|                            null|           3.0|            -8|               null|             null|[{true, true, nul...|
|         4|GCST010921_1|    4_1000626_G_A|515396081125|  1000626|0.11|     0.11|        0.07335946414470998|        0.14664053585529002|            0.052733261570389836|             0.22945669658321027|           4.0|            -9|               null|             null|[{true, true, nul...|
|         8|GCST010921_1|  8_118914750_A_G|515396081130|118914750| 0.1|      0.1|        0.06969205376215726|        0.13030794623784275|             0.04976460229089875|             0.20094604477184508|           1.0|           -10|               null|             null|[{true, true, nul...|
+----------+------------+-----------------+------------+---------+----+---------+---------------------------+---------------------------+--------------------------------+--------------------------------+--------------+--------------+-------------------+-----------------+--------------------+

So whatever the problem is (assuming there's a single cause), it affects the logic intra study. I'm going to see if I find any pattern between these loci.

ireneisdoomed / issues

287,382 associations with a null credible set #2

Testing with GCST006571_1