different sets in high LD

chr1swallace commented 3 years ago

Occasionally, I get a run of susie_rss() that puts variants in different credible sets although they have high LD (perhaps r2=1). When this happens, I can resolve this issue - ie force those high LD sets to merge, by rerunning with L=length(s$sets$cs)-1 where s is the output of the first susie_rss(). But is this sensible, is there a better way?

NB I do not have access to the original genotype data, so I am using an LD matrix from a reference population rather than the sample population. Whilst I think it is a good match, I realise this could cause issues.

gaow commented 3 years ago

If r2 = 1, the two corresponding z-scores should be identical if they match the original data. I assume in your case they will be a bit different; but how different are they? Just to double-check, 1) are the two variants in different credible sets both have high PIP close to 1 and the corresponding CS only contain one variant? 2) have you used z_ld_weight in susie_rss function call to try accounting for small discrepancy between LD reference and z-scores?

I think what you did make sense but I'm trying to see if we can solve it with a less heuristic approach. Could you also double-check the estimated residual variance? We have seen in the past that residual estimates ended up quite small as many CS having only one variant were generated for L = 10 (a large L). It would also help if it is possible to share an example data for us to take a look at (you can remove the variant ID so we just get a matrix of numbers without knowing what the data is about)

pcarbo commented 3 years ago

@gaow Do you think this could be resolved with a better intialization?

gaow commented 3 years ago

@pcarbo I think a better (or at least alternative) initialization is what @chr1swallace has done; and I'd like to get some additional information and understand the problem better before deciding if initialization is all we need to work this out.

chr1swallace commented 3 years ago

Thanks for the quick reply (and for the great work)!

The data are public, so happy to share, but it takes a while to run. I can produce the same result thinning to SNPs with smaller p values (<0.1) - is that useful to you?

Re your questions, the sets are

s$sets$cs $L2 [1] 1156

$L1 [1] 1159 1160 1161 1164 1165 1166 1167

lapply(s$sets$cs, function(i) z[i]) $L2 21:39092998 11.67021

$L1 21:39093586 21:39093608 21:39093975 21:39094373 21:39094542 21:39094644 -11.83871 -11.83957 -11.86096 -11.91935 -11.86631 - 11.87701 21:39094818

-11.83957

LD[s$sets$cs[[2]],s$sets$cs[[1]]] 21:39093586 21:39093608 21:39093975 21:39094373 21:39094542 21:39094644 1 1 1 1 1 1

21:39094818 1

So z scores are similar but not identical, while the (reference population) r2=1.

I set z_ld_weight=1/503 because the ref pop has 503 individuals. Perhaps this is just stupidly small and I should use a larger reference population?

On Mon, 2020-12-14 at 06:57 -0800, gaow wrote:

If r2 = 1, the two corresponding z-scores should be identical if they match the original data. I assume in your case they will be a bit different; but how different are they? Just to double-check, 1) are the two variants in different credible sets both have high PIP close to 1 and the corresponding CS only contain one variant? 2) have you used z_ld_weight in susie_rss function call to try accounting for small discrepancy between LD reference and z-scores?

I think what you did make sense but I'm trying to see if we can solve it with a less heuristic approach. Could you also double-check the estimated residual variance? We have seen in the past that residual estimates ended up quite small as many CS having only one variant were generated for L = 10 (a large L). It would also help if it is possible to share an example data for us to take a look at (you can remove the variant ID so we just get a matrix of numbers without knowing what the data is about)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

zouyuxin commented 3 years ago

It's better to have the dataset without thinning. I'll take a look and thining SNPs if it takes a long time to run. Could you briefly describe the background about your data? How did you get z scores and what's the reference panel?

stephens999 commented 3 years ago

I notice the z scores have opposite signs. Is it possible it is an allele switch issue or something like that? Is the correlation between them in the reference correlation matrix (R) 1 or -1? [just to be clear, note that the entries in the matrix R should be the correlations, not r2]

chr1swallace commented 3 years ago

That's a good spot, I'll investigate

⁣http://chr1swallace.github.io

On 14 Dec 2020, 17:22, at 17:22, stephens999 notifications@github.com wrote:

I notice the z scores have opposite signs. Is it possible it is an allele switch issue or something like that? Is the correlation between them in the reference correlation matrix (R) 1 or -1? [just to be clear, note that the entries in the matrix R should be the correlations, not r2]

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/stephenslab/susieR/issues/114#issuecomment-744588003

chr1swallace commented 3 years ago

That fixes it - an ambiguous G/C SNP that I failed to properly align. Thank you so much, and apologies for not spotting that myself!

On Mon, 2020-12-14 at 17:29 +0000, Chris Wallace wrote:

That's a good spot, I'll investigate

http://chr1swallace.github.io On 14 Dec 2020, at 17:22, stephens999 notifications@github.com wrote:

I notice the z scores have opposite signs. Is it possible it is an allele switch issue or something like that? Is the correlation between them in the reference correlation matrix (R) 1 or -1? [just to be clear, note that the entries in the matrix R should be the correlations, not r2]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

stephenslab / susieR

different sets in high LD #114