"Objective" and another "eval_metric" passed parameters in train() ?...

Hello,

With the following example, I would like to understand if it is allowed to pass within train() extra parameters related to a new objective function and a new eval_metric.

#----- XGB OWN AND DEFERRED METRIC
library(caret)
data(iris)

irisbig <- rbind(iris, iris)
for(i in 1:400){
  irisbig <- rbind(irisbig, iris)
}

irisbig$Species <- as.factor(as.numeric(irisbig$Species))

inTrain <- createDataPartition(irisbig$Species, p = 0.70 , list = FALSE)
trainDat <- irisbig[ inTrain, ]
testDat <- irisbig[ -inTrain, ]

## MODEL A - NO XGB METRIC
set.seed(6879)

bootControl <- trainControl(number=10)

xgbGrid <- expand.grid(
  eta = 0.3,
  max_depth = 1,
  nrounds = 50,
  gamma = 0,
  colsample_bytree = 0.6,
  min_child_weight = 1
)

modFitxgb_A <-  train(
  Species ~ .,
  data = trainDat,
  trControl = bootControl,
  tuneGrid = xgbGrid,
  metric = "Accuracy",
  method = "xgbTree",
  verbose = 1,
  num_class = 3
)

modFitxgb_A

Which produces this result:

eXtreme Gradient Boosting 

42210 samples
    4 predictor
    3 classes: '1', '2', '3' 

No pre-processing
Resampling: Bootstrapped (10 reps) 
Summary of sample sizes: 42210, 42210, 42210, 42210, 42210, 42210, ... 
Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD  
  0.9932423  0.9898633  0.005744415  0.00861682

Tuning parameter 'nrounds' was held constant at a value of
 value of 0.6
Tuning parameter 'min_child_weight' was held
 constant at a value of 1

Now, I define a new objective and a eval_metric(both are parameters that can be defined for xgboost), same xgbGridand same dataset:

## MODEL B - XGB METRIC AND OBJECTIVE
set.seed(6879)

bootControl <- trainControl(number=10)

xgbGrid <- expand.grid(
  eta = 0.3,
  max_depth = 1,
  nrounds = 50,
  gamma = 0,
  colsample_bytree = 0.6,
  min_child_weight = 1
)

modFitxgb_B <-  train(
  Species ~ .,
  data = trainDat,
  trControl = bootControl,
  tuneGrid = xgbGrid,
  metric = "Accuracy",
  method = "xgbTree",
  verbose = 1,
  objective = "multi:softmax",
  eval_metric = "mlogloss",
  num_class = 3
)

modFitxgb_B

Which produces an error:

Error in { : 
  task 1 failed - "arguments imply differing number of rows: 5194, 15581"
Además: Warning messages:
1: In matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) :
  la longitud de los datos [15581] no es un submúltiplo o múltiplo del número de filas [5194] en la matriz
2: In matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) :
  la longitud de los datos [15553] no es un submúltiplo o múltiplo del número de filas [5185] en la matriz
3: In matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) :
  la longitud de los datos [15467] no es un submúltiplo o múltiplo del número de filas [5156] en la matriz
4: In matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) :
  la longitud de los datos [15575] no es un submúltiplo o múltiplo del número de filas [5192] en la matriz
5: In matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) :
  la longitud de los datos [15424] no es un submúltiplo o múltiplo del número de filas [5142] en la matriz

This error could be related to the objective (??). Without that parameter there model runs and gets:

## MODEL B - XGB METRIC AND OBJECTIVE
set.seed(6879)

bootControl <- trainControl(number=10)

xgbGrid <- expand.grid(
  eta = 0.3,
  max_depth = 1,
  nrounds = 50,
  gamma = 0,
  colsample_bytree = 0.6,
  min_child_weight = 1
)

modFitxgb_B <-  train(
  Species ~ .,
  data = trainDat,
  trControl = bootControl,
  tuneGrid = xgbGrid,
  metric = "Accuracy",
  method = "xgbTree",
  verbose = 1,
  #objective = "multi:softmax",
  eval_metric = "mlogloss",
  num_class = 3
)

modFitxgb_B

eXtreme Gradient Boosting 

42210 samples
    4 predictor
    3 classes: '1', '2', '3' 

No pre-processing
Resampling: Bootstrapped (10 reps) 
Summary of sample sizes: 42210, 42210, 42210, 42210, 42210, 42210, ... 
Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD  
  0.9932423  0.9898633  0.005744415  0.00861682

Tuning parameter 'nrounds' was held constant at a value of 50
Tuning
 held constant at a value of 0.6
Tuning parameter 'min_child_weight' was
 held constant at a value of 1

Which basically yields the same Accuracy as the first version with no xgboost metric.

I know that in trainControl() you can define your own metric. But, is it not possible to pass the metric already defined in the function (xgboostin this case) with a parameter?.

And regarding the error that appears when defined the objective, what could be its explanation ?.

> version
               _                           
platform       x86_64-apple-darwin13.4.0   
arch           x86_64                      
os             darwin13.4.0                
system         x86_64, darwin13.4.0        
status                                     
major          3                           
minor          2.3                         
year           2015                        
month          12                          
day            10                          
svn rev        69752                       
language       R                           
version.string R version 3.2.3 (2015-12-10)
nickname       Wooden Christmas-Tree       
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

locale:
[1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plyr_1.8.3      xgboost_0.4-3   caret_6.0-64    ggplot2_2.1.0  
[5] lattice_0.20-33 deepboost_0.1.4

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3        magrittr_1.5       splines_3.2.3      MASS_7.3-45       
 [5] munsell_0.4.3      colorspace_1.2-6   foreach_1.4.3      minqa_1.2.4       
 [9] stringr_1.0.0      car_2.1-1          tools_3.2.3        parallel_3.2.3    
[13] nnet_7.3-12        pbkrtest_0.4-6     grid_3.2.3         data.table_1.9.6  
[17] gtable_0.2.0       nlme_3.1-125       mgcv_1.8-11        quantreg_5.21     
[21] e1071_1.6-7        class_7.3-14       MatrixModels_0.4-1 iterators_1.0.8   
[25] lme4_1.1-11        Matrix_1.2-4       nloptr_1.0.4       reshape2_1.4.1    
[29] codetools_0.2-14   rsconnect_0.4.1.4  stringi_1.0-1      compiler_3.2.3    
[33] scales_0.4.0       stats4_3.2.3       SparseM_1.7        chron_2.3-47      
>

Thanks, Carlos.

According to modelLookup("xgbTree") there seems to be support for passing just these parameters to xgb from caret:

nrounds
max_depth
eta
gamma
colsample_bytree
min_child_weight

objective and eval_metric you're trying to pass it are probably not supported.

Also, I believe the metric that you may custom define in caret is just used by caret to select the best model and again is not passed to xgb as an objective for training purposes. xgb must be using its default values compatible with the parameters it does receive

It seems there may be a way to do what you're asking as described here: http://topepo.github.io/caret/custom_models.html

Thanks for your reply, but I think that is not fully accurate.

Regarding the list of parameters, yes they are the ones you can use for tunning purposes. That means that those are the ones you can refer in tuneGrid() function. By the way, another very useful place to find the available tuning parameters for each model is available here: http://topepo.github.io/caret/modelList.html
And the second issue you point out, that those are the only possible parameters to pass to caret is where there seems to be a general confusion. You can pass many other parameters to your model if they are not the ones available for tuning. It is what appears in the train() help: "... arguments passed to the classification or regression routine (such as randomForest). Errors will occur if values for tuning parameters are passed here." In fact you can find an example of this in this help page: http://topepo.github.io/caret/training.html with this code, where it's passed verbose=FALSE which is a specific gbm paramerter:

set.seed(825)
gbmFit1 <- train(Class ~ ., data = training,
                 method = "gbm",
                 trControl = fitControl,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)
gbmFit1

I have already used this feature to perform a more model fine-tuning with parameters that are not in the training for xgboost as well as other type of models: ranger, gbm, etc. But it's true that for a metric parameter perhaps I am mistaken.

Thanks, Carlos.

Sorry for the late response...

While the three dots do point to the xgb.train call, we automatically set objective to be either "binary:logistic", "multi:softprob", or "reg:linear" depending on the case. I would expect that passing eval_metric to train will send it to xgb.train without issue.

For now, you can make a copy of the model code using getModelInfo and change objective using a custom method.

For some models (including gbm), we make these default choices but let the user override the values. The code is a little more complex but it is doable so I'll add that to the "to do" list. This is a better long-term solution.

Max

Thanks again Max for the clarifications and the references.

Carlos.

Hi Max,

I wonder if it is possible to set objective = "rank:pairwise" in caret method: "xgbTree"? Or only binary:logistic, multi:softprob and reg:linear can be passed?

Thanks!

Note the the link to "a custom method" was broken as of 2024. I believe the new link should go to "Chapter 13 Using Your Own Model in train" https://topepo.github.io/caret/using-your-own-model-in-train.html. Can @topepo confirm?

Would you consider changing the behavior of train(method = "xgbTree") in https://github.com/topepo/caret/blob/master/models/files/xgbTree.R and other xgb models to set default objective argument only when it is not specified in train()?

To make the objective argument effective in train() for xgboost models, I followed the minimal example at https://topepo.github.io/caret/using-your-own-model-in-train.html#Illustration6 and referenced the existing model info of xgbTree objects https://github.com/topepo/caret/blob/master/models/files/xgbTree.R.

xgb_custom <- getModelInfo("xgbTree", regex = FALSE)[[1]]
xgb_custom$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {

  if(!inherits(x, "xgb.DMatrix"))
    x <- xgboost::xgb.DMatrix(x, label = y, missing = NA) else
      xgboost::setinfo(x, "label", y)

  if (!is.null(wts))
    xgboost::setinfo(x, 'weight', wts)

  out <- xgboost::xgb.train(
    list(eta = param$eta,
         max_depth = param$max_depth,
         gamma = param$gamma,
         colsample_bytree = param$colsample_bytree,
         min_child_weight = param$min_child_weight,
         subsample = param$subsample),
    data = x,
    nrounds = param$nrounds,
    # objective = "reg:squarederror",
    ...)
  out
}

Then I can use the newly defined method as in train(y ~ ., data = data, method = xgb_custom, objective = "reg:tweedie", eval_metric = "mae", base_score = median(data$y)) to use other objective functions.

topepo / caret

"Objective" and another "eval_metric" passed parameters in train() ?... #389