Closed raechin closed 7 years ago
Dear @raechin,
some notes regarding stability selection:
1) You do not get p-values. What you obtain are selection frequencies which show you the stability of your results. 2) You should not use only variables which were pre-selected based on the same algorithm. This is somehow against the idea of stability selection. You can, however, select a subset of variables for other reasons. However, I would not do this unless necessary and especially I would not reduce my data set to a set of somehow predictive variables only. Use a (much) larger subset of variables so that stability selection can really work and distinguish stable and un-stable variables. 3) You cannot fix lambda. The idea of stability selection is to NOT specify a penalty parameter but to restrict the model by specifying the average number of non-zero coefficients and seeing how often which variable was selected. As you pre-specify lambda you get the same variables. Using stability selection correctly will give you with the full data set:
set.seed(1234)
stabsel(x = x, y = y, fitfun = glmnet.lasso, cutoff = 0.75, PFER = 1)
# Stability Selection with unimodality assumption
#
# Selected variables:
# waistcirc hipcirc
# 2 3
#
# Selection probabilities:
# age elbowbreadth kneebreadth anthro4 anthro3b anthro3c anthro3a waistcirc hipcirc
# 0.00 0.00 0.00 0.04 0.05 0.10 0.36 0.96 0.97
#
# ---
# Cutoff: 0.75; q: 2; PFER (*): 0.454
# (*) or expected number of low selection probability variables
# PFER (specified upper bound): 1
# PFER corresponds to signif. level 0.0504 (without multiplicity adjustment)
and with the subset:
set.seed(1234)
stabsel(x = xuse, y = y, fitfun = glmnet.lasso, cutoff = 0.75, PFER = 1)
# Stability Selection with unimodality assumption
#
# Selected variables:
# waistcirc hipcirc
# 1 2
#
# Selection probabilities:
# kneebreadth anthro3b anthro3c anthro3a waistcirc hipcirc
# 0.00 0.08 0.10 0.37 0.96 0.97
#
# ---
# Cutoff: 0.75; q: 2; PFER (*): 0.68
# (*) or expected number of low selection probability variables
# PFER (specified upper bound): 1
# PFER corresponds to signif. level 0.113 (without multiplicity adjustment)
As you can see, in this case, the same variables very stably selected. However, overall, the selection frequencies differ and in other cases different variables might end up in you final subset. Furthermore, the three stability selection parameters q (the average number of selected variables), cutoff (the selection frequency above which variables are termed stable) and PFER (the per family error rate) depend on each other but also on the number of candidate variables p. In the above examples you can see that the realized PFER is 0.454 in the case with all variables and 0.68 in the case with the subset. If the number of variables differs stronger, also the parameters might differ stronger.
All in all, please have a look at the README and relevant literature:
The latter publication gives you also some ideas about how to choose your stability selection parameters.
Dear Hofner,
I have many large matrices (1000 obs * 15000 vars) for lasso and variable selection. To speed up, I think it would be much faster to run stabsel() on data with variables whose lasso coefficients >0 (columns of x matrix with zero coefficients are removed).
Is this reasonable? Running stabsel() on full x matrix will return pvalue for all variables in x, while running it on reduced x matrix is much faster. The order of resulted pvalue for variables seem to be consistent using x or reduced x, but values are different.
Here is my code:
output:
Running stabsel() on x or reduced x (xuse) have the same selected variables. But is there any potential problem to run stabsel() on reduced x?
Thank you!