hofnerb / stabs

Stability Selection with Error Control
https://cran.r-project.org/package=stabs
26 stars 9 forks source link

q = 1 -> more than one variable selected #23

Closed competulix closed 7 years ago

competulix commented 7 years ago

I have a quick question: I used stabsel in combination with glmnet.lasso and set q to be 1. However, the results show that more than one variable has been selected - how is that possible?

stabsel(x = data[predict], y = data$freq_calls, fitfun=glmnet.lasso, args.fitfun=list(family="poisson"), PFER = 0.2, q=1, sampling.type="SS")

Stability Selection with unimodality assumption

Selected variables:
             A                E 
               2                6 

Selection probabilities:
  O                        N.                          Ag.                            Ed                 I                            G   
0.00                    0.01                    0.01                    0.02                    0.09                    0.10
C                          E                           A
0.19                    0.66                    0.67

---
Cutoff: 0.65; q: 1; PFER (*):  0.192 
   (*) or expected number of low selection probability variables
PFER (specified upper bound):  0.2 
PFER corresponds to signif. level 0.0213 (without multiplicity adjustment)

Thank you already.

hofnerb commented 7 years ago

The (average) number of covariates selected per data subset q does not necessarily correspond to the number of stably selected variables. However, the latter should usually be <= q.

Hence, it looks like a bug in the glmnet.lasso fitfun.


Of note:

If you use lars.lasso, everything works as expected as can be seen by the sum of selection probabilities:

## example from ?stabsel
library("stabs")
data("bodyfat", package = "TH.data")
(stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
                        fitfun = lars.lasso, cutoff = 0.75,
                        PFER = 1))
# Stability Selection with unimodality assumption
# 
# Selected variables:
#     waistcirc   hipcirc 
#                    2         3 
# 
# Selection probabilities:
#     age elbowbreadth  kneebreadth     anthro3c      anthro4     anthro3b     anthro3a    waistcirc      hipcirc 
#    0.00         0.00         0.00         0.01         0.01         0.02         0.11         0.90         0.95 
# 
# ---
#     Cutoff: 0.75; q: 2; PFER (*):  0.454 
# (*) or expected number of low selection probability variables
# PFER (specified upper bound):  1 
# PFER corresponds to signif. level 0.0504 (without multiplicity adjustment)
sum(stab.lasso$max)
# [1] 2

However, if you use glmnet.lasso the selection frequencies are above the pre-specified q:

(stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
                                   fitfun = glmnet.lasso, cutoff = 0.75,
                                   PFER = 1))
# Stability Selection with unimodality assumption
# 
# Selected variables:
#     waistcirc   hipcirc 
#             2         3 
# 
# Selection probabilities:
#     age elbowbreadth  kneebreadth     anthro3b      anthro4     anthro3c     anthro3a    waistcirc      hipcirc 
#    0.00         0.00         0.02         0.07         0.08         0.11         0.40         0.92         0.94 
# 
# ---
#     Cutoff: 0.75; q: 2; PFER (*):  0.454 
# (*) or expected number of low selection probability variables
# PFER (specified upper bound):  1 
# PFER corresponds to signif. level 0.0504 (without multiplicity adjustment)
sum(stab.glmnet$max)
# [1] 2.54
hofnerb commented 7 years ago

Currently, we use

glmnet::glmnet(model.matrix(~. - 1, bodyfat[, -2]), bodyfat[, 2], dfmax = 2 - 1)

to achieve models with at maximum q variables. However, on average, more than q variables are selected.

Alternatively, one could use

glmnet::glmnet(model.matrix(~. - 1, bodyfat[, -2]), bodyfat[, 2],  pmax = q)

However, this selects at maximum q variables, i.e., on average less than q. How can we best achieve exactly q variables?

berndbischl commented 7 years ago

@hofnerb

1) I know @competulix, thats "Clemens", I recommended to use stabs for one of his current papers.

2) Maybe an idea: I guess I would be interested to simply run stability selection without really arriving at a "final set" of selected variables. Just calculate the selection frequencies and then "look at them". That also means I dont have to worry about the hard problem of fixing q and PFER before I know my results.

The only thing that prohibits this is that you fix dfmax in glmnet for me. (which is also apparently not correct here, but thats not my point)

Does this make sense as a usecase that I would like stabs to support?

competulix commented 7 years ago

@berndbischl - thx now you blew my cover! But yeah that sounds like a reasonable idea.

@hofnerb - do you think that might work? I need to use the lasso.glmnet function in order to assume poisson or gamma distribution of the dependent variable. However, one could remove the dmax limitation and finish at the selection probabilities. What do you think about this approach and how reliable are those probabilities anyway (as you mentioned, they currently don't sum up).

I'd really appreciate your help as I want to include this analysis in a paper of mine (currently under revision).

Many thanks

Clemens

hofnerb commented 7 years ago

@competulix You can simply use the selection frequencies from stabsel but you should not REMOVE the dmax limitation and run the model until convergence.

Well, you can do this but this would be a different idea than stability selection. I would rather propose to use the current implementation (or we try I fix the issue).

Theoretically, this should not really be a big problem but some coding work. One simply would have to check that fit really just contains the first q variables and nothing more. I just tried to do this but ran into another error: (in my example above) we sometimes simply jump from one selected variable to three (!):

selected
# $s0
# NULL
# 
# $s1
# [1] 3
# 
# $s2
# [1] 3
# 
# $s3
# [1] 2 3 6
# 
# $s4
# [1] 2 3 6

This also seems to be the reason why pmax = q selects <= q variables. This seems to be an algorithmic issue of package glmnet. The steps towards the solution seem to be too big.

hofnerb commented 7 years ago

@competulix could you try the following fit.fun and compare the results with the results from glmnet.lasso?

glmnet.lasso2 <- function(x, y, q, ...) {
    if (!requireNamespace("glmnet", quietly=TRUE))
        stop("Package ", sQuote("glmnet"), " needed but not available")

    if (is.data.frame(x)) {
        message("Note: ", sQuote("x"),
                " is coerced to a model matrix without intercept")
        x <- model.matrix(~ . - 1, x)
    }

    if ("lambda" %in% names(list(...)))
        stop("It is not permitted to specify the penalty parameter ", sQuote("lambda"),
             " for lasso when used with stability selection.")

    ## fit model
    fit <- glmnet::glmnet(x, y, pmax = q, ...)

    ## which coefficients are non-zero?
    selected <- predict(fit, type = "nonzero")
    selected <- selected[[length(selected)]]
    ret <- logical(ncol(x))
    ret[selected] <- TRUE
    names(ret) <- colnames(x)
    ## compute selection paths
    cf <- fit$beta
    sequence <- as.matrix(cf != 0)
    ## return both
    return(list(selected = ret, path = sequence))
}

Actually, with glmnet.lasso2 we should be on the save side as the bound for the PFER is defined as follows

PFER <= 1/(2 * cutoff - 1) * q^2 / p

Hence, if our realised q is smaller than the pre-specified one, this leads to a conservative approximation. On the other hand, this upper bound is conservative anyway, so it should not be an issue if we use the anti-conservative glmnet.lasso implementation.

At least to get some impression of stable effects, both should be fine.

Please contact me via email (simply ask @berndbischl) for more help.

competulix commented 7 years ago

@hofnerb - Thank you for the quick reply - I'll try this and get back to you asap.

hofnerb commented 7 years ago

I just had a look at another implementation of stability selection in package hdi and found that they also use the conservative approximation of choosing at most q variables.

Hence, I changed the code in the package. Per default, glmnet.lasso now computes the conservative version (called glmnet.lasso2 above). The old, anti-conservative version can be used by setting args.fitfun = list(type = "anticonservative") (or an abbreviation thereof), e.g.:

stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
                         fitfun = glmnet.lasso, args.fitfun = list(type = "anti"), 
                         cutoff = 0.75, PFER = 1)
competulix commented 7 years ago

wow this was fast - are you gonna send that to CRAN as well?

hofnerb commented 7 years ago

I am currently preparing a new release version (as the last version is already very old) but am not sure if this is gonna work out today. If not, you can always download this github package as explained in the README.md.

competulix commented 7 years ago

Ok sounds great! Github installation is the current option - publication-wise an official CRAN release is preferable though ;-).

Thank you very much - I really like your work!

hofnerb commented 7 years ago

The new version is on CRAN now. Thanks for your input.