Graphical models query - Githubissues

richardbeare commented 8 years ago

Hi, Not sure if this is the forum you'd like to use for queries - let me know if it isn't.

I'm exploring approaches using the JGL package, specifically the fused group lasso. I'm likely to be working with two groups. I have the mechanisms in place to compute the two lambda values. The difference in partial correlation coefficient for corresponding graph edges is of interest. I have explored bootstrapping approaches to characterising this, but a stability selection approach looks interesting.

I'm unsure of how to use the q parameter in this setting. Do you have examples for glasso-like cases? I also need to be careful about how the resampling occurs within groups.

Thanks

hofnerb commented 8 years ago

In principle, this is the correct place for your querry. However, I cannot really advice on the use of stability selection with graphical models as I am no expert in the latter. However, there exists some literature regarding this combination:

Already the original paper considers graphical models.
Another paper on graphical models with stability selection is given here

Searching the web will surely provide more examples of various flavors of graphical modelling with stability selection.

Regarding your general questions regarding stability selection:

As shown in our article, the choice of q should be such that it is large enough to capture all anticipated variables but (usually much) smaller than the number of available predictors. Meinshausen and Bühlmann propose in one place to choose q = sqrt(0.8 * p) or sqrt(0.8 * alpha * p), where alpha is for example 0.05 (i.e., the significance level). Yet, these choices are not applicable in all cases. I'd suggest to play arround and have a look at the selection frequencies as well as keep an eye on the PFER.

Regarding your final question: How is your data structured, i.e. what do you mean with grouped data? Do you have multiple measurements on the same subject? In that case, you should perhaps consider resampling individuals instead of resampling observations. Yet, I haven't seen any stability selection case with grouped data where the grouping was taken into account.

Please note that resampling of size n/2 is important for the derivation of the bound for the PFER. Thus, I am not fully aware of the impact on theoretical properties!

richardbeare commented 8 years ago

Thanks for your comments - I realise this is a potentially tricky question. Are you aware of R packages for stability selection with graphical models - I didn't see any during a quick can search?

On Tue, Sep 6, 2016 at 9:31 PM, Benjamin Hofner notifications@github.com wrote:

In principle, this is the correct place for your querry. However, I cannot really advice on the use of stability selection with graphical models as I am no expert in the latter. However, there exists some literature regarding this combination:

Already the original paper http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2010.00740.x/abstract considers graphical models.

Another paper on graphical models with stability selection is given here http://www.sciencedirect.com/science/article/pii/S0167947313000789

Searching the web will surely provide more examples of various flavors of graphical modelling with stability selection.

Regarding your general questions regarding stability selection:

As shown in our article http://www.biomedcentral.com/content/pdf/s12859-015-0575-3.pdf, the choice of q should be such that it is large enough to capture all anticipated variables but (usually much) smaller than the number of available predictors. Meinshausen and Bühlmann http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2010.00740.x/abstract propose in one place to choose q = sqrt(0.8 * p) or sqrt(0.8 * alpha * p), where alpha is for example 0.05 (i.e., the significance level). Yet, these choices are not applicable in all cases. I'd suggest to play arround and have a look at the selection frequencies as well as keep an eye on the PFER.

Regarding your final question: How is your data structured, i.e. what do you mean with grouped data? Do you have multiple measurements on the same subject? In that case, you should perhaps consider resampling individuals instead of resampling observations. Yet, I haven't seen any stability selection case with grouped data where the grouping was taken into account.

Please note that resampling of size n/2 is important for the derivation of the bound for the PFER. Thus, I am not fully aware of the impact on theoretical properties!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hofnerb/stabs/issues/12#issuecomment-244923384, or mute the thread https://github.com/notifications/unsubscribe-auth/AAvooe2ytZ-nKMYFPaQOUtyp4ynYU8hlks5qnU74gaJpZM4J1quf .

hofnerb commented 8 years ago

Sorry, I don't know such a package. (I also don't know any other package which implements the Shah/Samworth bounds which are usually preferable).

However, I would love to add the relevant functions to stabs. What I would need is a function that takes arguments x, y and q (further arguments can be passed along via ...) and returns the selected variables and potentially the selection path. See the README for details. In that case, you could use the complete infrastructure of stabs (i.e., resampling, error control, parameter computation, ...). This would be my preferred way.

If we would need to use resampling of individuals rather than cases, we could consider to implement such a resampling functionality as well. However, you can always do this by hand if you use stabsel and provide user specified folds.

Another way would be to do the resampling (with samples of size floor(n/2)) on your own and compute the PFER for a given cutoff (aka threshold) and q (or analogously and one of the three parameters given the other two) via the function

stabsel_parameters(p, cutoff, q,  PFER, B, assumption, ...)

If the first way is doable, you could either provide a patch (i.e., the relevant code) or pointers to the relevant packages and functions. I would then assist you writing the required function(s) and manual(s).

richardbeare commented 8 years ago

Thanks, I can see an easy starting point for graphical models without groups - maybe the group approach will become clear later. However, how should q be interpreted for the graphical case? Is it the number of non-zero entries in the inverse, or some function of that? I can certainly make a start on a prototype.

On Wed, Sep 7, 2016 at 5:01 PM, Benjamin Hofner notifications@github.com wrote:

Sorry, I don't know such a package. (I also don't know any other package which implements the Shah/Samworth bounds which are usually preferable).

However, I would love to add the relevant functions to stabs. What I would need is a function that takes arguments x, y and q (further arguments can be passed along via ...) and returns the selected variables and potentially the selection path. See the README https://github.com/hofnerb/stabs/blob/master/README.md for details. In that case, you could use the complete infrastructure of stabs (i.e., resampling, error control, parameter computation, ...). This would be my preferred way.

If we would need to use resampling of individuals rather than cases, we could consider to implement such a resampling functionality as well. However, you can always do this by hand if you use stabsel and provide user specified folds.

Another way would be to do the resampling (with samples of size floor(n/2)) on your own and compute the PFER for a given cutoff (aka threshold) and q (or analogously and one of the three parameters given the other two) via the function

stabsel_parameters(p, cutoff, q, PFER, B, assumption, ...)

If the first way is doable, you could either provide a patch (i.e., the relevant code) or pointers to the relevant packages and functions. I would then assist you writing the required function(s) and manual(s).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hofnerb/stabs/issues/12#issuecomment-245192957, or mute the thread https://github.com/notifications/unsubscribe-auth/AAvooXqGOu1GkEwtNEL6GzbZhZfq3URHks5qnmFSgaJpZM4J1quf .

hofnerb commented 8 years ago

Correct. See Meinshausen and Bühlmann

graphical_model

richardbeare commented 8 years ago

Was just starting to look into coding this and discovered the "pulsar" package:

https://cran.r-project.org/package=pulsar

Which looks like it might do a lot of the work. Having a read now to see how it works.

On Wed, Sep 7, 2016 at 10:22 PM, Benjamin Hofner notifications@github.com wrote:

Correct. See Meinshausen and Bühlmann http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2010.00740.x/abstract

[image: graphical_model] https://cloud.githubusercontent.com/assets/8823088/18311767/698aa6c0-7506-11e6-98e8-48059ef172b8.PNG

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hofnerb/stabs/issues/12#issuecomment-245263525, or mute the thread https://github.com/notifications/unsubscribe-auth/AAvooUP7f0MQn9fdQxcneJHbbxLzsdczks5qnqyigaJpZM4J1quf .

richardbeare commented 8 years ago

Hi, I've had a look at the other packages I mentioned - pulsar and huge. They seem to focus on the "stars" methodology, which is strictly about selection of regularization level. While they use resampling to test stability at a given regularization level, they don't combine the results of different resamplings in the stability selection style.

I've written a couple of stubs for testing graphical methods - see what you think. I'm curious about how the number of subsamples and the cutoff should change in this scenario. Note that these stubs require a slightly modified version of stabs: devtools::install_github("richardbeare/stabs", ref="GraphTrialsA")

getLamPath <- function (max, min, len, log = FALSE)
{
  if (max < min)
    stop("Did you flip min and max?")
  if (log) {
    min <- log(min)
    max <- log(max)
  }
  lams <- seq(max, min, length.out = len)
  if (log)
    exp(lams)
  else lams
}

set.seed(10010)
p <- 40 ; n <- 1000
dat  <- huge::huge.generator(n, p, "hub", verbose=FALSE, v=.1, u=.5)

stabs.quic <- function(x, y, q, ...)
{
  ## sort out a lambda path
  if (!requireNamespace("QUIC")) {
    stop("Package ", sQuote("QUIC"), " is required but not available")
  }
  empirical.cov <- cov(x)
  max.cov <- max(abs(empirical.cov[upper.tri(empirical.cov)]))
  lams <- getLamPath(max.cov, max.cov*0.05, len=40)
  est <- QUIC::QUIC(empirical.cov, rho=1, path=lams,msg=0)
  ut <- upper.tri(empirical.cov)
  qvals <- sapply(1:length(lams), function(idx){
    m <- est$X[,,idx]
    sum(m[ut] != 0)
  })

  ## Not sure if it is better to have more or less than q
  lamidx <- which.max(qvals >= q)
  ## Need to return the entire upper triangle - think about how to save
  ## ram later
  M <- est$X[,,lamidx][ut]
  selected <- (M != 0)
  s <- sapply(1:lamidx, function(idx){
    m <- est$X[,,idx][ut] != 0
    return(m)
  })
  colnames(s) <- as.character(1:ncol(s))
  return(list(selected=selected, path=s))
}

sq <- stabsel(x=dat$data, y=dat$data, fitfun=stabs.quic, cutoff=0.75,
PFER=1)

hofnerb / stabs

Graphical models query #12