Expected vs Observed z-scores are not the same

tamil-acog commented 2 months ago

Hi team,

I am trying to build a pipeline for fine-mapping with Susie. My results are sometimes off. I'll describe about my input data and what are the issues I am facing, please help me resolving that if possible.

I use, UKBB data for the fine mapping. Both my sumstats and LD matrix are from UKBB data.

After, going through some of the discussions in the issues, I found out that I have follow the following,

1) ESTIMATE_RESIDUAL_VARIANCE = False 2) Calculate the LD matrix with built-in R "cor()" function rather than plink.

After, adjusting my pipeline with the above changes, I face the following issues: 1) Expected vs Observed scores plot, still doesn't exactly match even though I have an In-sample LD matrix. 2) It takes a very large time to calculate the correlation matrix using built-in R. Is there a better way to do it? 3) Sometimes I don't get any credible sets. So, what should be an ideal, "coverage" parameter?

Expected vs Observed plot:

Z-scores distribution:

pcarbo commented 2 months ago

@tamil-acog The first thing that jumps out at me is that your association results don't seem very strong. I presume you first ran a basic association analysis (in PLINK, for example)? What were the smallest p-values from this association analysis? If the association results are not strong enoug it may not make sense to perform fine-mapping in this region. (Typically we look for p-values smaller than approximately 1e-8, although this may be different in UK Biobank depending on how the association analysis is conducted.)

tamil-acog commented 2 months ago

Hi Thank you very much for the timely response. I got your point and I checked the p-values and you were right. Thanks

But my concerns are mainly on "Expected vs Observed Z-scores": I checked for other traits, I got some hits there in the credible sets. But, still the "expected vs observed" plot is same as above, though my LD matrix is in-sample.

Some info:

My reference panel is UKBB 450k data for LD matrix. My GWAS also comes from this data only. So, it is In-sample LD matrix
I use plink to calculate the LD matrix("plink --bfile mydata --extract variants.txt --keep-allele-order --r --matrix --our ld_matrix")
For the latest run, where I got some credible sets, my Lambda was 0.0442

My questions:

Why is still my "expected vs observed" plot off as compared to the straight line as shown in the susie examples? I also went through some github issues and I read that, if the LD matrix is In-sample, we are supposed to get a straight line. What am I missing here? And is it ok that my plot is off even though it is in-sample LD?
What other method is recommended to calculate the LD matrix other than plink? I tried built-in R "corr" function, but I am unable to completely parallelize the operation and it takes very long time?(Asking this because, in certain Github post, read that built-in R corr has better round-off errors compared to plink and I also suspect this is the reason for my exp vs obs plot being off.)
What is an acceptable range of Lambda?

pcarbo commented 2 months ago

Hi @tamil-acog, I'm not super familiar with PLINK, but this does look like the right approach. Did you also run your association analysis in PLINK?

I will note that others have encountered challenges in making the z-scores and LD consistent, so you are far from the only one. See for example Issue 207; I recommend searching the Issues on GitHub for other discussion.

It might also be helpful to reviews at the steps we took to generate the assocation statistics and LD matrices for our PLoS Genetics paper. The scripts can be found here.

Hope this helps.

stephenslab / susieR

Expected vs Observed z-scores are not the same #236