topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 632 forks source link

xgbLinear does not set parameter 'booster=gblinear' (and fits tree based model) #1158

Open erikson84 opened 4 years ago

erikson84 commented 4 years ago

When training a model using method='xgbLinear', caret does not set the proper parameter in XGBoost (booster='gblinear') and the resulting model is based on the regression tree base learner.

Minimal, reproducible example:

Minimal dataset:

library(caret)
library(xgboost)

set.seed(1)
params <- data.frame(nrounds=200, lambda=0, alpha=0, eta=0.3)

sim_data <- twoClassSim()

X <- as.matrix(sim_data[, 1:(ncol(sim_data) - 1)])
y <- sim_data$Class

Minimal, runnable code:

caretLin <- train(X, y, method='xgbLinear', tuneGrid = params,
                  trControl=trainControl(number=1), metric='error')

xgbLin <- xgboost(data=X, label=as.numeric(factor(y))-1, 
                  nrounds=200, 
                  params=list(booster='gblinear', objective='binary:logistic', lambda=0, alpha=0, eta=0.3))

xgbTree <- xgboost(data=X, label=as.numeric(factor(y))-1, 
                   nrounds=200, 
                   params=list(objective='binary:logistic', lambda=0, alpha=0, eta=0.3))

# Prediction are the same for the tree model
table(predict(xgbTree, X)>0.5, predict(caretLin))

# Very similar, but different to the linear model
table(predict(xgbLin, X)>0.5, predict(caretLin))

# I don't know if this is a good way to assert the models are the same, but here it is
# Same size, almost equal
mean(caretLin$finalModel$raw == xgbTree$raw)
# Different sizes, mean doesn't mean much (ha!)
mean(xgbLin$raw == caretLin$finalModel$raw)
# Raw sizes for comparison - xgbLin is much smaller
c(caretLin=length(caretLin$finalModel$raw), xgbTree=length(xgbTree$raw), xgbLin=length(xgbLin$raw))

Session Info:

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xgboost_1.0.0.2 caret_6.0-86    ggplot2_3.3.2   lattice_0.20-38

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6         pillar_1.4.3         compiler_3.6.3       gower_0.2.1         
 [5] plyr_1.8.6           class_7.3-15         iterators_1.0.12     tools_3.6.3         
 [9] rpart_4.1-15         ipred_0.9-9          lubridate_1.7.9      lifecycle_0.2.0     
[13] tibble_3.0.1         gtable_0.3.0         nlme_3.1-144         pkgconfig_2.0.3     
[17] rlang_0.4.6          Matrix_1.2-18        foreach_1.5.0        rstudioapi_0.11     
[21] yaml_2.2.1           prodlim_2019.11.13   xfun_0.15            e1071_1.7-3         
[25] stringr_1.4.0        withr_2.1.2          dplyr_1.0.0          knitr_1.29          
[29] pROC_1.16.1          generics_0.0.2       vctrs_0.3.1          recipes_0.1.13      
[33] stats4_3.6.3         nnet_7.3-12          grid_3.6.3           tidyselect_1.1.0    
[37] data.table_1.12.8    glue_1.4.0           R6_2.4.1             DALEX_1.3.0         
[41] survival_3.1-8       lava_1.6.6           reshape2_1.4.3       purrr_0.3.4         
[45] magrittr_1.5         ModelMetrics_1.2.2.2 splines_3.6.3        MASS_7.3-51.5       
[49] scales_1.0.0         codetools_0.2-16     ellipsis_0.3.0       timeDate_3043.102   
[53] colorspace_1.4-1     stringi_1.4.6        munsell_0.5.0        crayon_1.3.4   
david-hurley commented 2 years ago

This has been open quite some time and not seeing any response from the dev team. I have also noticed this same issue, so as of now booster = gblinear is not being set in the xgblinear script which is referenced when calling method = xgblinear. This results in method = xgblinear defaulting to the gbtree booster. See example below, both methods produce the exact same RMSE.

# create some fake test data
set.seed(1)
y <- rnorm(100,20,10)
x1 <- rnorm(100,50,9)
x2< - rnorm(100,200,64)
train_data <- cbind(y, x1, x2)

# 10 fold cv
train_stratified_control <- caret::trainControl(
  method = "cv", 
  number = 10
) 

############################ xgblinear ########################################

# defaults from xgboost manual for xgblinear
xgboost_linear_grid <- expand.grid(
  nrounds = 100, 
  eta = 0.3,
  alpha = 0,
  lambda = 1
)

set.seed(1)

# specify solver as xgblinear
xgboost_linear_model <- caret::train(
  y ~., 
  data = train_data, 
  method = "xgbLinear",
  trControl = train_stratified_control,
  metric = "RMSE",
  tuneGrid = xgboost_linear_grid,
  verbose = FALSE
)

print('RMSE for caret::train XGBLinear')
xgboost_linear_model$results$RMSE

############################ xgbtree ########################################

# defualts from xgboost manual - tree based needs these
xgboost_tree_grid <- expand.grid(
  eta = 0.3,
  max_depth = 6,
  nrounds = 100,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

set.seed(1)

# specify solver as xgbtree
xgboost_tree_model <- caret::train(
  y ~., 
  data = train_data, 
  method = "xgbTree",
  trControl = train_stratified_control,
  metric = "RMSE",
  tuneGrid = xgboost_tree_grid

)

print('RMSE for caret::train XGBTree')
mean(xgboost_tree_model$results$RMSE)
gmonaie commented 2 years ago

Just adding to this thread -- agree that this is indeed an outstanding issue in the fit method of getModelInfo("xgbLinear")$xgbLinear

In the meantime, this can be resolved by simply passing the additional argument booster='gblinear' to caret::train() and xgboost will pick it up in the parameters