stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
169 stars 42 forks source link

Is it possible to test the hypothesis that the significant variants are in a set of SNPs? #196

Open garyzhubc opened 12 months ago

garyzhubc commented 12 months ago

Is it possible to test the hypothesis that the significant variants are in a set of SNPs?

garyzhubc commented 11 months ago

Not just sets from the susie_get_cs()

stephens999 commented 11 months ago

sorry I don't think we understand the question.

garyzhubc commented 11 months ago

Can I test the hypothesis like "beta_2,beta_3\ne0, beta_1,beta_4=0"?

stephens999 commented 11 months ago

perhaps the simplest way would be to sample from the (approximate) posterior distribution on beta, and then use that sample to compute the probability of any particular combination of 0/non-zero beta values.

garyzhubc commented 11 months ago

Something like this (test the hypothesis that significant variants are in the middle)?

posterior<-unlist(lapply(1:1000, function(i) (max(samp$b[40:60,i])>0 | min(samp$b[40:60,i])<0) & (samp$b[1,1:39] == 0 & samp$b[1,61:101] == 0)))

result in:

> mean(posterior)
[1] 0.9756098
stephens999 commented 11 months ago

I think the code doesn't look quite right, but something like that, yes

garyzhubc commented 11 months ago

I made an edit on the code. Could you tell me why it doesn't look right?

stephens999 commented 11 months ago

you have samp$b[1,1:39] but samp$b[40:60,i]

so the indices of b don't look consistent through your code

garyzhubc commented 11 months ago

Thanks for spotting this mistake. I'm now using:

> posterior<-unlist(lapply(1:num_samples, function(i) (max(samp$b[40:60,i])>0 | min(samp$b[40:60,i])<0) & all(c(samp$b[1:39,i] == 0, samp$b[61:101,i] == 0))))
> mean(posterior)
[1] 0
> posterior<-unlist(lapply(1:num_samples, function(i) (max(samp$b[1:39,i])>0 | min(samp$b[1:39,i])<0) & all(samp$b[40:101,i] == 0)))
> mean(posterior)
[1] 0
> posterior<-unlist(lapply(1:num_samples, function(i) (max(samp$b[61:101,i])>0 | min(samp$b[61:101,i])<0) & all(samp$b[1:60,i] == 0)))
> mean(posterior)
[1] 0

which I believe is consistent. However, it looks like these all have very small probabilities. Do you recommend instead of testing all(samp$b[1:60,i] == 0) try something like all(abs(samp$b[1:60,i]) < episilon), or do you think what I'm doing is okay?

garyzhubc commented 9 months ago

Can anyone respond to this issue please? @pcarbo

pcarbo commented 9 months ago

The issue of working with very small probabilities is a common issue and there are some ways to help with this. If you can share with us a reproducible example illustrating exactly what you are trying to do, I might be able to help you.

As a general piece of advice, I recommend starting with an example that is as simple as possible, e.g., an example with exactly 4 variables X.

garyzhubc commented 9 months ago

An example of four variables b1, b2, b3, b4: suppose I want to calculate P(b1=0 and b4 =0 and (b2!=0 or b3!=0)) as the probability that the causal variant is in {b2,b3}, shall I run this program below to get the probability?

samp<-susie_get_posterior_samples(res_, num_samples)
posterior23<-unlist(lapply(1:num_samples, function(i) (max(samp$b[2:3,i])>0 | min(samp$b[2:3,i])<0) & all(c(samp$b[1,i] == 0, samp$b[4,i] == 0))))
mean(posterior23)

If I rank posterior1, posterior2, posterior3, posterior12, posterior23, posterior13, posterior123 and select the one with probability greater than 0.95, will I get the same credible interval (default coverage = 0.95):

susie_get_cs(res_)
pcarbo commented 9 months ago

@garyzhubc This is not a reproducible example because some variables in your code (e.g., res_) are not defined.

stephens999 commented 8 months ago

However, it looks like these all have very small probabilities. Do you recommend instead of testing all(samp$b[1:60,i] == 0) try something like all(abs(samp$b[1:60,i]) < episilon), or do you think what I'm doing is okay?

I think what you are doing looks OK, and you are just getting the answer that there is a very small probability of the event you are looking at. You could also try the epsilon approach you suggested.

garyzhubc commented 8 months ago

I also tried using PIP directly instead of sampling.

prod(1-res$pip[1:34])*(1-prod(1-res$pip[35:68]))*prod(1-res$pip[69:101])

still gives zero probabilities. See https://github.com/stephenslab/susieR/issues/203#issuecomment-1773921285

garyzhubc commented 8 months ago

@garyzhubc This is not a reproducible example because some variables in your code (e.g., res_) are not defined.

I could do the same on this example https://stephenslab.github.io/susieR/articles/sparse_susie_eval.html:

create_sparsity_mat = function(sparsity, n, p) {
  nonzero          <- round(n*p*(1-sparsity))
  nonzero.idx      <- sample(n*p, nonzero)
  mat              <- numeric(n*p)
  mat[nonzero.idx] <- 1
  mat              <- matrix(mat, nrow=n, ncol=p)
}
n <- 1000
p <- 1000
beta <- rep(0,p)
beta[c(1,300,400,1000)] <- 10 
X.dense  <- create_sparsity_mat(0.99,n,p)
X.sparse <- as(X.dense,"CsparseMatrix")
y <- c(X.dense %*% beta + rnorm(n))
susie.sparse <- susie(X.sparse,y)

Using Monte Carlo sample from posterior:

num_samples<-10000
samp<-susie_get_posterior_samples(susie.sparse, num_samples)
posterior<-unlist(lapply(1:num_samples, function(i) (max(samp$b[341:680,i])>0 | min(samp$b[341:680,i])<0) & all(c(samp$b[1:340,i] == 0, samp$b[681:1000,i] == 0))))
mean(posterior)

Using PIP:

prod(1-susie.sparse$pip[1:340])*(1-prod(1-susie.sparse$pip[341:680]))*prod(1-susie.sparse$pip[681:1000])

Both gives probability zero.

pcarbo commented 8 months ago

@garyzhubc I think the issue is that in your example all the inclusion probabilities are either 1 or very, very small, so it may be tricky to use a naive Monte Carlo sampling approach:

hist(log10(susie.sparse$alpha),n = 64)

One idea that comes to mind is importance sampling, but you might want to start with an example where the probabilities are less extreme.