Closed competulix closed 7 years ago
The (average) number of covariates selected per data subset q
does not necessarily correspond to the number of stably selected variables. However, the latter should usually be <= q
.
Hence, it looks like a bug in the glmnet.lasso
fitfun.
Of note:
If you use lars.lasso
, everything works as expected as can be seen by the sum of selection probabilities:
## example from ?stabsel
library("stabs")
data("bodyfat", package = "TH.data")
(stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = lars.lasso, cutoff = 0.75,
PFER = 1))
# Stability Selection with unimodality assumption
#
# Selected variables:
# waistcirc hipcirc
# 2 3
#
# Selection probabilities:
# age elbowbreadth kneebreadth anthro3c anthro4 anthro3b anthro3a waistcirc hipcirc
# 0.00 0.00 0.00 0.01 0.01 0.02 0.11 0.90 0.95
#
# ---
# Cutoff: 0.75; q: 2; PFER (*): 0.454
# (*) or expected number of low selection probability variables
# PFER (specified upper bound): 1
# PFER corresponds to signif. level 0.0504 (without multiplicity adjustment)
sum(stab.lasso$max)
# [1] 2
However, if you use glmnet.lasso
the selection frequencies are above the pre-specified q
:
(stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = glmnet.lasso, cutoff = 0.75,
PFER = 1))
# Stability Selection with unimodality assumption
#
# Selected variables:
# waistcirc hipcirc
# 2 3
#
# Selection probabilities:
# age elbowbreadth kneebreadth anthro3b anthro4 anthro3c anthro3a waistcirc hipcirc
# 0.00 0.00 0.02 0.07 0.08 0.11 0.40 0.92 0.94
#
# ---
# Cutoff: 0.75; q: 2; PFER (*): 0.454
# (*) or expected number of low selection probability variables
# PFER (specified upper bound): 1
# PFER corresponds to signif. level 0.0504 (without multiplicity adjustment)
sum(stab.glmnet$max)
# [1] 2.54
Currently, we use
glmnet::glmnet(model.matrix(~. - 1, bodyfat[, -2]), bodyfat[, 2], dfmax = 2 - 1)
to achieve models with at maximum q
variables. However, on average, more than q
variables are selected.
Alternatively, one could use
glmnet::glmnet(model.matrix(~. - 1, bodyfat[, -2]), bodyfat[, 2], pmax = q)
However, this selects at maximum q
variables, i.e., on average less than q
. How can we best achieve exactly q
variables?
@hofnerb
1) I know @competulix, thats "Clemens", I recommended to use stabs for one of his current papers.
2) Maybe an idea: I guess I would be interested to simply run stability selection without really arriving at a "final set" of selected variables. Just calculate the selection frequencies and then "look at them". That also means I dont have to worry about the hard problem of fixing q and PFER before I know my results.
The only thing that prohibits this is that you fix dfmax in glmnet for me. (which is also apparently not correct here, but thats not my point)
Does this make sense as a usecase that I would like stabs to support?
@berndbischl - thx now you blew my cover! But yeah that sounds like a reasonable idea.
@hofnerb - do you think that might work? I need to use the lasso.glmnet function in order to assume poisson or gamma distribution of the dependent variable. However, one could remove the dmax limitation and finish at the selection probabilities. What do you think about this approach and how reliable are those probabilities anyway (as you mentioned, they currently don't sum up).
I'd really appreciate your help as I want to include this analysis in a paper of mine (currently under revision).
Many thanks
Clemens
@competulix You can simply use the selection frequencies from stabsel but you should not REMOVE the dmax
limitation and run the model until convergence.
Well, you can do this but this would be a different idea than stability selection. I would rather propose to use the current implementation (or we try I fix the issue).
Theoretically, this should not really be a big problem but some coding work. One simply would have to check that fit really just contains the first q
variables and nothing more. I just tried to do this but ran into another error: (in my example above) we sometimes simply jump from one selected variable to three (!):
selected
# $s0
# NULL
#
# $s1
# [1] 3
#
# $s2
# [1] 3
#
# $s3
# [1] 2 3 6
#
# $s4
# [1] 2 3 6
This also seems to be the reason why pmax = q
selects <= q
variables. This seems to be an algorithmic issue of package glmnet. The steps towards the solution seem to be too big.
@competulix could you try the following fit.fun and compare the results with the results from glmnet.lasso?
glmnet.lasso2 <- function(x, y, q, ...) {
if (!requireNamespace("glmnet", quietly=TRUE))
stop("Package ", sQuote("glmnet"), " needed but not available")
if (is.data.frame(x)) {
message("Note: ", sQuote("x"),
" is coerced to a model matrix without intercept")
x <- model.matrix(~ . - 1, x)
}
if ("lambda" %in% names(list(...)))
stop("It is not permitted to specify the penalty parameter ", sQuote("lambda"),
" for lasso when used with stability selection.")
## fit model
fit <- glmnet::glmnet(x, y, pmax = q, ...)
## which coefficients are non-zero?
selected <- predict(fit, type = "nonzero")
selected <- selected[[length(selected)]]
ret <- logical(ncol(x))
ret[selected] <- TRUE
names(ret) <- colnames(x)
## compute selection paths
cf <- fit$beta
sequence <- as.matrix(cf != 0)
## return both
return(list(selected = ret, path = sequence))
}
Actually, with glmnet.lasso2
we should be on the save side as the bound for the PFER is defined as follows
PFER <= 1/(2 * cutoff - 1) * q^2 / p
Hence, if our realised q
is smaller than the pre-specified one, this leads to a conservative approximation. On the other hand, this upper bound is conservative anyway, so it should not be an issue if we use the anti-conservative glmnet.lasso
implementation.
At least to get some impression of stable effects, both should be fine.
Please contact me via email (simply ask @berndbischl) for more help.
@hofnerb - Thank you for the quick reply - I'll try this and get back to you asap.
I just had a look at another implementation of stability selection in package hdi and found that they also use the conservative approximation of choosing at most q
variables.
Hence, I changed the code in the package. Per default, glmnet.lasso
now computes the conservative version (called glmnet.lasso2
above). The old, anti-conservative version can be used by setting args.fitfun = list(type = "anticonservative")
(or an abbreviation thereof), e.g.:
stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = glmnet.lasso, args.fitfun = list(type = "anti"),
cutoff = 0.75, PFER = 1)
wow this was fast - are you gonna send that to CRAN as well?
I am currently preparing a new release version (as the last version is already very old) but am not sure if this is gonna work out today. If not, you can always download this github package as explained in the README.md.
Ok sounds great! Github installation is the current option - publication-wise an official CRAN release is preferable though ;-).
Thank you very much - I really like your work!
The new version is on CRAN now. Thanks for your input.
I have a quick question: I used stabsel in combination with glmnet.lasso and set q to be 1. However, the results show that more than one variable has been selected - how is that possible?
Thank you already.