stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
176 stars 45 forks source link

Under the global null model, choice of L affects PIP #76

Closed aksarkar closed 5 years ago

aksarkar commented 5 years ago

If none of the predictors have a detectable effect, then the PIP reported by susieR does not match what we might expect for L > 1.

Under the spike-and-slab prior, I would expect the PIP to be approximately 1 / p. But under the susie variational approximation, the PIP depends on the number of assumed effects L (i.e. number of variational parameters for the causal indicator variable).

We found this applying susieR to a variable selection problem where we didn't pre-screen using marginal association statistics (as in genetic fine mapping). We just wanted to solve multiple regression with possibly correlated predictors.

Minimal example:

> set.seed(1)
> susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=1)$pip
[1] 0.2430892 0.2804019 0.2182692 0.1248026 0.1334370
> susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5)$pip
[1] 0.5864162 0.5968658 0.7780974 0.6788985 0.6898738
> sessionInfo()
Info()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS: /scratch/midway2/aksarkar/miniconda3/envs/scqtl/lib/R/lib/libRblas.so
LAPACK: /scratch/midway2/aksarkar/miniconda3/envs/scqtl/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] susieR_0.6.2.0394

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16       roxygen2_6.1.0     matrixStats_0.54.0 lattice_0.20-34   
 [5] digest_0.6.12      withr_2.1.2        commonmark_1.5     grid_3.4.1        
 [9] R6_2.2.0           magrittr_1.5       rlang_0.1.2        stringi_1.2.4     
[13] testthat_2.0.0     xml2_1.2.0         Matrix_1.2-14      devtools_1.13.5   
[17] tools_3.4.1        stringr_1.3.1      compiler_3.4.1     memoise_1.1.0     
[21] expm_0.999-3      
> 

Revision 8a4f717

gaow commented 5 years ago

@aksarkar yes, though PIP for each effect makes sense.

> res$alpha
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 0.1875131 0.2290305 0.2017719 0.1762895 0.2053949
[2,] 0.1870604 0.2310171 0.2012803 0.1760361 0.2046060
[3,] 0.1865364 0.2326830 0.2009511 0.1756878 0.2041418
[4,] 0.1863538 0.2328922 0.2009752 0.1755439 0.2042350
[5,] 0.1866494 0.2315677 0.2012928 0.1757278 0.2047623

It is a case when the purity filter should have worked, that is:

> res = susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5)
> res$sets
$cs
NULL

$coverage
[1] 0.95

In other words the reported PIP should all be zero ... Previously it is a separate function that takes only index from CS to compute PIP. It is a new interface artifact when I combined susie_get_pip inside the susie call. I'll fix it.

BTW other approaches that "filters" L will also help, eg, the null_weight will help, even for a very minor null weight:

> res = susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5, null_weight=0.5)
> res$pip
[1] 0.2828823 0.3588216 0.3345083 0.3973354 0.2260887 0.9926927
> res$null_index
[1] 6

(the last PIP is from the NULL component.)

Or simply let it esitmate prior variance,

> res = susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5, estimate_prior_variance=T)
> res$V
[1] 0 0 0 0 0

But we have not formally implemented something to prune from these V estimates. Our paper recommends "purity" (absolute minimum correlation) filter, which is by default 0.5 in the software implementation.

I'll fix that interface bug now.

gaow commented 5 years ago

@aksarkar So with the patch above,

> set.seed(1)
> susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5)$pip
[1] 0 0 0 0 0

Does this make sense? This is what I did for the numerical studies in the paper anyways.

@stephens999 in our PIP calculation section in the manuscript we did not factor in the CS consideration -- our summation was from l = 1 to L. Maybe we want to change that.

pcarbo commented 5 years ago

Under the spike-and-slab prior, I would expect the PIP to be approximately 1 / p. But under the susie variational approximation, the PIP depends on the number of assumed effects L (i.e. number of variational parameters for the causal indicator variable).

@aksarkar This is what I would expect. What is the issue you are reporting?

@gaow Would it be possible to compute a likelihood or Bayes factor for different values of L in order to select L?

gaow commented 5 years ago

@pcarbo yes we do report that. For the example above,

> exp(res$lbf)
[1] 0.2180514 0.2180175 0.2181433 0.2182756 0.2182854
> 

but we do not use it for selecting L. At least in my view the purity filter is more intuitive and the threshold there is easier to interpret than BF cutoffs.

aksarkar commented 5 years ago

@pcarbo In the following case, susieR does not report 1 / p:

> susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5)$pip
[1] 0.5864162 0.5968658 0.7780974 0.6788985 0.6898738

AIUI, this comes from the fact that we estimate PIPj = 1 - prod{l=1}^L (1 - alpha_lj)

pcarbo commented 5 years ago

I just installed susieR version 0.6.2.0395 from GitHub and I got all zeros:

susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5)$pip
# [1] 0 0 0 0 0

Did I do something wrong?

aksarkar commented 5 years ago

@pcarbo I was on 0.6.2.0394. I think @gaow already pushed the fix he described above.

stephens999 commented 5 years ago

I don't want to prune before pip calculation by default

stephens999 commented 5 years ago

The pips reported by Susie are reasonable under the Susie model a I think. They should not be 0 under the model

stephens999 commented 5 years ago

1/p would only be approximately correct under assumption L=1. For spike and slab the pips will depend on prior hyperparameters. We could discuss further in person. But the key point is that we have a model, and the pips we report are computed under that model. If we prune the CS before report it is less clear what the pips represent.

stephens999 commented 5 years ago

I don't see any problem with pips depending on L. They similarly depend on pi0 in spike and slab...

gaow commented 5 years ago

I don't see any problem with pips depending on L. They similarly depend on pi0 in spike and slab...

I think the distinction maybe L is not a prior, but some instruments we put for the computation, and how we interpret it after. Previously susie_get_pip was not built to the SuSiE model fitting function (susieR::susie), and the model fitting part correctly reports those alpha as 1/p (shown in my first post). The susie_get_pip used to be a separate function that we report PIP in light of our belief on how many L we should report. To me, PIP = 0 (which implies our "estimated" L = 0) is informative if I understand how it happened ... We can discuss this behavior for the software, and in other words, how to best communicate this to the users.

@stephens999 just to make it clear that in our numerical studies I did prune before computing PIP.

stephens999 commented 5 years ago

Can you redo without pruning?

On Wed, Dec 5, 2018, 13:09 gaow <notifications@github.com wrote:

I don't see any problem with pips depending on L. They similarly depend on pi0 in spike and slab...

I think the distinction maybe L is not a prior, but some instruments we put for the computation, and how we interpret it after. Previously susie_get_pip was not built to the SuSiE model fit, and the model fitting part correctly reports those alpha as 1/p (shown in my first post). The susie_get_pip used to be a separate function that we report PIP in light of our belief on how many L we should report.To me, PIP = 0 is informative if I understand how it happened ... We can discuss this behavior for the software, and in other words, how to best communicate this to the users.

@stephens999 https://github.com/stephens999 just to make it clear that in our numerical studies I did prune before computing PIP.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/76#issuecomment-444585074, or mute the thread https://github.com/notifications/unsubscribe-auth/ABt4xdZDQwsLVW4ecgua0aJX3fxP7H6jks5u2AvIgaJpZM4ZC_dl .

gaow commented 5 years ago

Sure, although in our settings I doubt it will make any difference.

stephens999 commented 5 years ago

I agree. That's why to keep things simple.

On Wed, Dec 5, 2018, 16:05 gaow <notifications@github.com wrote:

Sure, although in our settings I doubt it will make any difference.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/76#issuecomment-444647583, or mute the thread https://github.com/notifications/unsubscribe-auth/ABt4xXjvfM5roJTCoaC8gIl8-bGAshTOks5u2DUQgaJpZM4ZC_dl .

gaow commented 5 years ago

Okay I've reverted the change for now -- we can discuss more later:

> set.seed(1); dat = susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5)
> dat$pip
[1] 0.6955438 0.7137817 0.6906197 0.6219080 0.6316382
> dat$sets
$cs
NULL

$coverage
[1] 0.95

Notice that:

  1. Without purity cutoff we end up having one CS anyways because the 5 CS here are identical and we removed duplicate CS.
> set.seed(1); dat = susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5, min_abs_cor = 0)
> dat$sets
$cs
$cs$L1
[1] 1 2 3 4 5

$purity
   min.abs.corr mean.abs.corr median.abs.corr
L1  0.006593154      0.226734      0.03793143

$cs_index
[1] 1

$coverage
[1] 0.95

(as discussed offline we can imaging overlapping CS will be a bit tricker).

  1. Here is how null_weight will in fact change the resulting PIP without pruning:
>  set.seed(1); susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5, null_weight=0.1)$pip
[1] 0.4650484 0.4883830 0.4563233 0.3698489 0.3813200
>  set.seed(1); susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5, null_weight=0.5)$pip
[1] 0.14815783 0.16344121 0.13957722 0.09120890 0.09645865
>  set.seed(1); susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5, null_weight=0.8)$pip
[1] 0.04789028 0.05426298 0.04385576 0.02624462 0.02796934
>  set.seed(1); susieR::susie(matrix(rnorm(500 * 5), nrow=500), rnorm(500), L=5, null_weight=0.99)$pip
[1] 0.002172408 0.002503407 0.001952651 0.001119151 0.001196372