glmnet lambdas - Githubissues

rlesca01 commented 9 years ago

I noticed that when running glmnet with caret it seems to use fixed lambdas. A huge portion of the time it seems like I get 0.1 as the lambda selected. There may very well be a good reason for this but I know that using discrete lambdas is often not faster than running the full regularization path. In addition I've seen cases where I get better performance with cv.glmnet.

Natively glmnet uses this to logic to decide the minimum lambda:

ifelse(nobs<nvars,0.01,0.0001)

And currently this is the code in the caret glmnet grid function:

expand.grid(alpha = seq(0.1, 1, length = len),
            lambda = seq(.1, 3, length = 3 * len))

This is just some benchmarking I did with a built dataset.

library(caret)
library(glmnet)

data(mdrr)

# ridge specifying a lambda
system.time(
 tmp <- glmnet(data.matrix(mdrrDescr),
               mdrrClass,
               alpha=0,
               lambda=0.001,
               family="binomial"))
#    user  system elapsed 
#    0.86    0.00    0.89

# ridge with default lambdas used (100)
system.time(
 tmp <- glmnet(data.matrix(mdrrDescr),
               mdrrClass,
               alpha=0,
               family="binomial"))
#   user  system elapsed 
#    0.25    0.00    0.24

Changing alpha=1 is 7.47 sec for the full set of lambdas or 1.41 sec for lambda 0.001.

Anyway I was just wondering if there is a reason for this or if in the future we might be able to think about having glmnet in caret run with the native lambda. I know that the obvious workaround to get more glmnet like set of lambdas is to specify tuneGrid in train manually.

Thanks for the feedback, Reynald

zachmayer commented 9 years ago

What happens if you take the default lambdas (100) and explicitly pass them them to glmnet?

One idea would be to have caret use glmnet's default lambdas in the default tuneGrid.

rlesca01 commented 9 years ago

If you run glmnet and then get the lambdas out and plug them back in to run it again it works fine and takes exactly the same amount of time.

The problem is that the full lambda sequence in determined in the Hastie's underlying fortran code.

I've just been using tuneGrid and guessing at a good sequence if I'm going to use glmnet with caret. Setting a higher tuneLength expands alpha as well, which isn't desirable.

Thanks

zachmayer commented 9 years ago

I wonder if we could write a function that would approximate Hastie's underlying fortran code and take the guesswork out of finding a good lambda sequence. If we can do that, I think we can put it into caret.

topepo commented 9 years ago

The default code that you mention:

expand.grid(alpha = seq(0.1, 1, length = len),
            lambda = seq(.1, 3, length = 3 * len))

is really for the default grid. Instead of going the fortran route, we can do something along the lines of the rpart models: fit an initial model and pull off lambdas from that list.

While this isn't that computationally effective, note that glmnet calls different worker functions depending on the distribution (e.g. elnet for Gaussian, fishnet for Poisson, etc) that generate the lambdas and fits. For all we know, each of these has different math to get the lambda path.

One solution might be to have the grid function do something like this:

if(tuneLength > 5) {
   run initial glmnet and get path
   pick off tuneLength values of the lambda path
} else use default grid

If you are going to fit a granular grid of lambda values, the computational expense of the initial glmnet fit becomes less painful. I will see of there are options that can be used on the initial fit to shorten the time (e.g. setting max iterations = 1, etc).

topepo commented 9 years ago

I just did some testing on moderately sized data sets.

First, doing an initial glmnet fit didn't cost much time compared to the overall process. This will not always be the case but it is not too bad.

Second, it fits a lot of bad models, at least for the three disparate cases that I tried. The path includes a lot of models with really large lambda values that regularize the crap out of the coefficients. That doesn't mean it will never work but I would never try values of lambda in the thousands. There is code below for one example from APM*. The others that I tried use Pfizer data.

Third, this does expose a design flaw in the package. I should have passed ... to the grid function. Since I didn't, there is a disconnect between the model that might be fitted to create the grid and the models that are created by the fit function. We could add this to the existing functions easily but it would break backward compatibility for custom models out there.

Overall, I think this would be a good idea for making the default grid. It might make sense, for small values of tuneLength (say < 5) to avoid excessively large values of lambda.

*

library(AppliedPredictiveModeling)

data(solubility)

set.seed(100)
indx <- createFolds(solTrainY, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)

init <- glmnet(as.matrix(solTrainXtrans), solTrainY)

enetGrid <- expand.grid(lambda = init$lambda[seq(1, length(init$lambda), by = 5)], 
                        # originally c(0, 0.01, .1),
                        alpha = seq(.05, 1, length = 20))
## now:
## > summary(init$lambda[seq(1, length(init$lambda), by = 5)])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0001954 0.0018140 0.0166700 0.1811000 0.1511000 1.3470000

set.seed(100)
gnetTune <- train(solTrainXtrans, solTrainY,
                  method = "glmnet",
                  tuneGrid = enetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))
## > system.time(glmnet(as.matrix(solTrainXtrans), solTrainY))[3]/gnetTune$times$everything[3]*100
## elapsed 
## 1.506544 
## 
## the initial fit would have taken < 2% of the total fitting time

topepo commented 9 years ago

Ug. I forgot about alpha.

I'll try this on some other examples too be here is the APM example again:

> library(plyr)
> alphas <- (1:9)/10
> 
> for(i in seq(along = alphas)) {
+   init <- glmnet(as.matrix(solTrainXtrans), solTrainY, alpha = alphas[i])
+   tmp <- data.frame(alpha = alphas[i], lambda = init$lambda)
+   out <- if(i == 1) tmp else rbind(out, tmp)
+ }
> 
> ddply(out, .(alpha), function(x) summary(x$lambda))
  alpha      Min.  1st Qu.  Median   Mean 3rd Qu.   Max.
1   0.1 0.0013470 0.013480 0.13490 1.5160  1.3480 13.470
2   0.2 0.0006735 0.006741 0.06743 0.7581  0.6741  6.735
3   0.3 0.0004490 0.004494 0.04495 0.5054  0.4494  4.490
4   0.4 0.0003368 0.003370 0.03371 0.3790  0.3370  3.368
5   0.5 0.0002694 0.002696 0.02697 0.3032  0.2696  2.694
6   0.6 0.0003575 0.003186 0.02833 0.2660  0.2525  2.245
7   0.7 0.0004879 0.003870 0.03067 0.2406  0.2430  1.924
8   0.8 0.0003890 0.003159 0.02559 0.2082  0.2078  1.684
9   0.9 0.0001497 0.001498 0.01498 0.1685  0.1498  1.497

topepo commented 9 years ago

I'm going to check in code that uses the function below with alpha = 0.5 to split the difference. Also, I trim away the largest and smallest values to make the grid less extreme.

Test as you see fit...

function(x, y, len = NULL) {
  numLev <- if(is.character(y) | is.factor(y)) length(levels(y)) else NA
  if(!is.na(numLev)) {
    fam <- ifelse(numLev > 2, "multinomial", "binomial")
  } else fam <- "gaussian"    
  init <- glmnet(as.matrix(x), y, 
                 family = fam, 
                 nlambda = len+2, 
                 alpha = .5)
  lambda <- unique(init$lambda)
  lambda <- lambda[-c(1, length(lambda))]
  lambda <- lambda[1:min(length(lambda), len)]
  expand.grid(alpha = seq(0.1, 1, length = len),
              lambda = lambda)
}

topepo commented 9 years ago

added to the package

topepo / caret

glmnet lambdas #116