Can I use stabsel on non-zero matrix after lasso?

Dear Hofner,

I have many large matrices (1000 obs * 15000 vars) for lasso and variable selection. To speed up, I think it would be much faster to run stabsel() on data with variables whose lasso coefficients >0 (columns of x matrix with zero coefficients are removed).

Is this reasonable? Running stabsel() on full x matrix will return pvalue for all variables in x, while running it on reduced x matrix is much faster. The order of resulted pvalue for variables seem to be consistent using x or reduced x, but values are different.

Here is my code:

library(glmnet)
library(stabs)

#data
set.seed(1234)
data("bodyfat", package = "TH.data")
x = as.matrix(bodyfat[, -2])
y = bodyfat[,2]

#use cv.glmnet to get best lambda
cv.fit=cv.glmnet(x,y, alpha = 1)
cv.fit$lambda.min
res = glmnet( x = x, y = y, family = "gaussian", alpha = 1, lambda =cv.fit$lambda.min )

#xuse: x with lasso coefficient>0 
nonZero = which(res$beta != 0)
xuse = as.matrix(x[, nonZero]) 

#stabsel on xuse (zero coefficient columns removed)
set.seed(1234)
stab.lasso1 <- stabsel(x = xuse, y = y, fitfun = glmnet.lasso,args.fitfun=list('family'="gaussian",'lambda'=cv.fit$lambda.min), cutoff = 0.75, PFER = 1) 

#stabsel on x
set.seed(1234)
stab.lasso2 <- stabsel(x = x, y = y, fitfun = glmnet.lasso,args.fitfun=list('family'="gaussian",'lambda'=cv.fit$lambda.min), cutoff = 0.75, PFER = 1) 

stab.lasso1$phat
stab.lasso2$phat
stab.lasso2$phat[nonZero]

output:

stab.lasso1$phat s0 waistcirc 1.00 hipcirc 1.00 kneebreadth 0.97 anthro3a 0.68 anthro3b 0.92 anthro3c 0.57

stab.lasso2$phat s0 age 0.52 waistcirc 0.99 hipcirc 1.00 elbowbreadth 0.45 kneebreadth 0.94 anthro3a 0.57 anthro3b 0.85 anthro3c 0.56 anthro4 0.17

stab.lasso1$selected waistcirc hipcirc kneebreadth anthro3b 1 2 3 5 stab.lasso2$selected waistcirc hipcirc kneebreadth anthro3b 2 3 5 7

Running stabsel() on x or reduced x (xuse) have the same selected variables. But is there any potential problem to run stabsel() on reduced x?

Thank you!

Dear @raechin,

some notes regarding stability selection:

1) You do not get p-values. What you obtain are selection frequencies which show you the stability of your results. 2) You should not use only variables which were pre-selected based on the same algorithm. This is somehow against the idea of stability selection. You can, however, select a subset of variables for other reasons. However, I would not do this unless necessary and especially I would not reduce my data set to a set of somehow predictive variables only. Use a (much) larger subset of variables so that stability selection can really work and distinguish stable and un-stable variables. 3) You cannot fix lambda. The idea of stability selection is to NOT specify a penalty parameter but to restrict the model by specifying the average number of non-zero coefficients and seeing how often which variable was selected. As you pre-specify lambda you get the same variables. Using stability selection correctly will give you with the full data set:

set.seed(1234)
stabsel(x = x, y = y, fitfun = glmnet.lasso, cutoff = 0.75, PFER = 1) 
# Stability Selection with unimodality assumption
# 
# Selected variables:
#     waistcirc   hipcirc 
#             2         3 
# 
# Selection probabilities:
#     age elbowbreadth  kneebreadth      anthro4     anthro3b     anthro3c     anthro3a    waistcirc      hipcirc 
#    0.00         0.00         0.00         0.04         0.05         0.10         0.36         0.96         0.97 
# 
# ---
# Cutoff: 0.75; q: 2; PFER (*):  0.454 
#    (*) or expected number of low selection probability variables
# PFER (specified upper bound):  1 
# PFER corresponds to signif. level 0.0504 (without multiplicity adjustment)

and with the subset:

set.seed(1234)
stabsel(x = xuse, y = y, fitfun = glmnet.lasso, cutoff = 0.75, PFER = 1) 
# Stability Selection with unimodality assumption
# 
# Selected variables:
#     waistcirc   hipcirc 
#             1         2 
# 
# Selection probabilities:
#     kneebreadth    anthro3b    anthro3c    anthro3a   waistcirc     hipcirc 
#            0.00        0.08        0.10        0.37        0.96        0.97 
# 
# ---
# Cutoff: 0.75; q: 2; PFER (*):  0.68 
#     (*) or expected number of low selection probability variables
# PFER (specified upper bound):  1 
# PFER corresponds to signif. level 0.113 (without multiplicity adjustment)

As you can see, in this case, the same variables very stably selected. However, overall, the selection frequencies differ and in other cases different variables might end up in you final subset. Furthermore, the three stability selection parameters q (the average number of selected variables), cutoff (the selection frequency above which variables are termed stable) and PFER (the per family error rate) depend on each other but also on the number of candidate variables p. In the above examples you can see that the realized PFER is 0.454 in the case with all variables and 0.68 in the case with the subset. If the number of variables differs stronger, also the parameters might differ stronger.

All in all, please have a look at the README and relevant literature:

For theoretical background: Meinshausen, N. and Bühlmann, P. (2010), Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72: 417–473. doi:10.1111/j.1467-9868.2010.00740.x
For stability selection with boosting (and some background on the package): Benjamin Hofner, Luigi Boccuto and Markus Goeker (2015). Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinformatics, 16:144. doi:10.1186/s12859-015-0575-3

The latter publication gives you also some ideas about how to choose your stability selection parameters.

hofnerb / stabs

Can I use stabsel on non-zero matrix after lasso? #21