Closed rlesca01 closed 9 years ago
What happens if you take the default lambdas (100) and explicitly pass them them to glmnet
?
One idea would be to have caret use glmnet
's default lambdas in the default tuneGrid
.
If you run glmnet and then get the lambdas out and plug them back in to run it again it works fine and takes exactly the same amount of time.
The problem is that the full lambda sequence in determined in the Hastie's underlying fortran code.
I've just been using tuneGrid and guessing at a good sequence if I'm going to use glmnet with caret. Setting a higher tuneLength expands alpha as well, which isn't desirable.
Thanks
I wonder if we could write a function that would approximate Hastie's underlying fortran code and take the guesswork out of finding a good lambda sequence. If we can do that, I think we can put it into caret.
The default code that you mention:
expand.grid(alpha = seq(0.1, 1, length = len),
lambda = seq(.1, 3, length = 3 * len))
is really for the default grid. Instead of going the fortran route, we can do something along the lines of the rpart
models: fit an initial model and pull off lambdas from that list.
While this isn't that computationally effective, note that glmnet
calls different worker functions depending on the distribution (e.g. elnet
for Gaussian, fishnet
for Poisson, etc) that generate the lambdas and fits. For all we know, each of these has different math to get the lambda path.
One solution might be to have the grid
function do something like this:
if(tuneLength > 5) {
run initial glmnet and get path
pick off tuneLength values of the lambda path
} else use default grid
If you are going to fit a granular grid of lambda values, the computational expense of the initial glmnet
fit becomes less painful. I will see of there are options that can be used on the initial fit to shorten the time (e.g. setting max iterations = 1, etc).
I just did some testing on moderately sized data sets.
First, doing an initial glmnet
fit didn't cost much time compared to the overall process. This will not always be the case but it is not too bad.
Second, it fits a lot of bad models, at least for the three disparate cases that I tried. The path includes a lot of models with really large lambda values that regularize the crap out of the coefficients. That doesn't mean it will never work but I would never try values of lambda in the thousands. There is code below for one example from APM*. The others that I tried use Pfizer data.
Third, this does expose a design flaw in the package. I should have passed ...
to the grid
function. Since I didn't, there is a disconnect between the model that might be fitted to create the grid and the models that are created by the fit
function. We could add this to the existing functions easily but it would break backward compatibility for custom models out there.
Overall, I think this would be a good idea for making the default grid. It might make sense, for small values of tuneLength
(say < 5) to avoid excessively large values of lambda.
*
library(AppliedPredictiveModeling)
data(solubility)
set.seed(100)
indx <- createFolds(solTrainY, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
init <- glmnet(as.matrix(solTrainXtrans), solTrainY)
enetGrid <- expand.grid(lambda = init$lambda[seq(1, length(init$lambda), by = 5)],
# originally c(0, 0.01, .1),
alpha = seq(.05, 1, length = 20))
## now:
## > summary(init$lambda[seq(1, length(init$lambda), by = 5)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001954 0.0018140 0.0166700 0.1811000 0.1511000 1.3470000
set.seed(100)
gnetTune <- train(solTrainXtrans, solTrainY,
method = "glmnet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale"))
## > system.time(glmnet(as.matrix(solTrainXtrans), solTrainY))[3]/gnetTune$times$everything[3]*100
## elapsed
## 1.506544
##
## the initial fit would have taken < 2% of the total fitting time
Ug. I forgot about alpha
.
I'll try this on some other examples too be here is the APM example again:
> library(plyr)
> alphas <- (1:9)/10
>
> for(i in seq(along = alphas)) {
+ init <- glmnet(as.matrix(solTrainXtrans), solTrainY, alpha = alphas[i])
+ tmp <- data.frame(alpha = alphas[i], lambda = init$lambda)
+ out <- if(i == 1) tmp else rbind(out, tmp)
+ }
>
> ddply(out, .(alpha), function(x) summary(x$lambda))
alpha Min. 1st Qu. Median Mean 3rd Qu. Max.
1 0.1 0.0013470 0.013480 0.13490 1.5160 1.3480 13.470
2 0.2 0.0006735 0.006741 0.06743 0.7581 0.6741 6.735
3 0.3 0.0004490 0.004494 0.04495 0.5054 0.4494 4.490
4 0.4 0.0003368 0.003370 0.03371 0.3790 0.3370 3.368
5 0.5 0.0002694 0.002696 0.02697 0.3032 0.2696 2.694
6 0.6 0.0003575 0.003186 0.02833 0.2660 0.2525 2.245
7 0.7 0.0004879 0.003870 0.03067 0.2406 0.2430 1.924
8 0.8 0.0003890 0.003159 0.02559 0.2082 0.2078 1.684
9 0.9 0.0001497 0.001498 0.01498 0.1685 0.1498 1.497
I'm going to check in code that uses the function below with alpha = 0.5
to split the difference. Also, I trim away the largest and smallest values to make the grid less extreme.
Test as you see fit...
function(x, y, len = NULL) {
numLev <- if(is.character(y) | is.factor(y)) length(levels(y)) else NA
if(!is.na(numLev)) {
fam <- ifelse(numLev > 2, "multinomial", "binomial")
} else fam <- "gaussian"
init <- glmnet(as.matrix(x), y,
family = fam,
nlambda = len+2,
alpha = .5)
lambda <- unique(init$lambda)
lambda <- lambda[-c(1, length(lambda))]
lambda <- lambda[1:min(length(lambda), len)]
expand.grid(alpha = seq(0.1, 1, length = len),
lambda = lambda)
}
added to the package
I noticed that when running glmnet with caret it seems to use fixed lambdas. A huge portion of the time it seems like I get 0.1 as the lambda selected. There may very well be a good reason for this but I know that using discrete lambdas is often not faster than running the full regularization path. In addition I've seen cases where I get better performance with cv.glmnet.
Natively glmnet uses this to logic to decide the minimum lambda:
And currently this is the code in the caret glmnet grid function:
This is just some benchmarking I did with a built dataset.
Changing alpha=1 is 7.47 sec for the full set of lambdas or 1.41 sec for lambda 0.001.
Anyway I was just wondering if there is a reason for this or if in the future we might be able to think about having glmnet in caret run with the native lambda. I know that the obvious workaround to get more glmnet like set of lambdas is to specify tuneGrid in train manually.
Thanks for the feedback, Reynald