topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 634 forks source link

GBM fails when doing quantile regression #309

Open zachmayer opened 8 years ago

zachmayer commented 8 years ago
library(caret)
library(gbm)
data(iris)
X <- iris[,2:4]
Y <- iris[,1]

gbmFit1 <- train(
  X, Y,
  method = "gbm", verbose=FALSE,
  distribution = list(name="quantile",alpha=0.25),
  trControl = trainControl(method = "cv")
)

I think maybe caret isn't properly handling the predictions coming from the quantile regression GBM, but am not sure.

zachmayer commented 8 years ago
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin14.5.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plyr_1.8.3      gbm_2.1.1       survival_2.38-3 caret_6.0-58    ggplot2_1.0.1   lattice_0.20-33

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1        magrittr_1.5       MASS_7.3-44        munsell_0.4.2      colorspace_1.2-6   foreach_1.4.3      minqa_1.2.4       
 [8] stringr_1.0.0      car_2.1-0          tools_3.2.2        nnet_7.3-11        pbkrtest_0.4-2     grid_3.2.2         gtable_0.1.2      
[15] nlme_3.1-122       mgcv_1.8-8         quantreg_5.19      MatrixModels_0.4-1 iterators_1.0.8    lme4_1.1-10        digest_0.6.8      
[22] Matrix_1.2-2       nloptr_1.0.4       reshape2_1.4.1     codetools_0.2-14   stringi_1.0-1      compiler_3.2.2     scales_0.3.0      
[29] stats4_3.2.2       SparseM_1.7        proto_0.3-10    
topepo commented 8 years ago

I think that it is fixed now. Please test. Also:

zachmayer commented 8 years ago

Thanks! I'll check it out.

topepo commented 8 years ago

Did this work?

zachmayer commented 8 years ago

I installed master with: devtools::install_github('topepo/caret/pkg/caret@master') and re-ran:

library(caret)
library(gbm)
data(iris)
X <- iris[,2:4]
Y <- iris[,1]

gbmFit1 <- train(
  X, Y,
  method = "gbm", verbose=FALSE,
  distribution = list(name="quantile",alpha=0.25),
  trControl = trainControl(method = "cv")
)

But I still got an error:

Error in { : 
  task 1 failed - "arguments imply differing number of rows: 3, 0"

I think the problem is that I'm providing distribution as a list: list(name="quantile",alpha=0.25), rather than a character variable: quantile.

This will also be a problem for pairwise metrics, e.g. distribution=list(name="pairwise",group=iris$Species,metric='mrr')

zachmayer commented 8 years ago

Interesting. It works if you specify trainControl(method = 'none'), but fails if you specify trainControl(method = 'cv', number=5).

zachmayer commented 8 years ago

I tried all the GBM distributions, with interesting results:

set.seed(1)
library(caret)
library(gbm)
dat <- twoClassSim()
X <- dat[,1:15]
Y <- as.integer(dat[,16]) - 1

ctrl <- trainControl(method = 'cv', number=5)

Working:

train(
  X, Y, method='gbm', distribution='gaussian', verbose=FALSE,
  trControl=ctrl, tuneLength=1
)
train(
  X, Y, method='gbm', distribution='laplace', verbose=FALSE,
  trControl=ctrl, tuneLength=1
)
train(
  X, Y, method='gbm', distribution='tdist', verbose=FALSE,
  trControl=ctrl, tuneLength=1
)
train(
  X, Y, method='gbm', distribution='poisson', verbose=FALSE,
  trControl=ctrl, tuneLength=1
)
train(
  X, factor(Y), method='gbm', distribution='bernoulli', verbose=FALSE,
  trControl=ctrl, tuneLength=1
)
train(
  X, factor(Y), method='gbm', distribution='huberized', verbose=FALSE,
  trControl=ctrl, tuneLength=1
)
train(
  X, factor(Y), method='gbm', distribution='adaboost', verbose=FALSE,
  trControl=ctrl, tuneLength=1
)
train(
  X, Y, method='gbm', distribution=list(name="tdist", df=8), verbose=FALSE,
  trControl=ctrl, tuneLength=1
)

Not working:

train(
  X, Y, method='gbm', distribution=list(name="quantile",alpha=0.25), verbose=FALSE,
  trControl=ctrl, tuneLength=1
)
train(
  X, Y, method='gbm', distribution=list(name="pairwise", group=1, metric='mrr'), verbose=FALSE,
  trControl=ctrl, tuneLength=1
)
train(
  X, Surv(Y), method='gbm', distribution='coxph', verbose=FALSE,
  trControl=ctrl, tuneLength=1
)

So quantile, pairwise, and survival models don't work at the moment.

zachmayer commented 8 years ago

FYI, here's the gbm.fit code for all of the above models:

gbm.fit(X, Y, distribution='gaussian', verbose=FALSE)
gbm.fit(X, Y, distribution='laplace', verbose=FALSE)
gbm.fit(X, Y, distribution='tdist', verbose=FALSE)
gbm.fit(X, Y, distribution='poisson', verbose=FALSE)
gbm.fit(X, factor(Y), distribution='bernoulli', verbose=FALSE)
gbm.fit(X, factor(Y), distribution='huberized', verbose=FALSE)
gbm.fit(X, factor(Y), distribution='adaboost', verbose=FALSE)
gbm.fit(X, Y, distribution=list(name="tdist", df=8), verbose=FALSE)
gbm.fit(X, Y, distribution=list(name="quantile",alpha=0.25), verbose=FALSE)
gbm.fit(X, Y, distribution=list(name="pairwise", group=1, metric='mrr'), verbose=FALSE)
gbm.fit(X, Surv(Y), distribution='coxph', verbose=FALSE)

You can see they produce models

scworland commented 8 years ago

Was this ever resolved? I am still receiving a similar error when using certain distributions and gbm. This will work:

devtools::install_github("gbm-developers/gbm")
fit1 <- gbm.fit(x,y,distribution="gamma")

but this returns an error:

library(caret)
fit2 <- train(x,y, method='gbm', distribution='gamma', trControl=ctrl, tuneLength=1)

task 1 failed - "arguments imply differing number of rows: 3, 0"

The last model will run fine if the distribution is changed to 'gaussian'.