topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 633 forks source link

"Objective" and another "eval_metric" passed parameters in train() ?... #389

Closed coforfe closed 8 years ago

coforfe commented 8 years ago

Hello,

With the following example, I would like to understand if it is allowed to pass within train() extra parameters related to a new objective function and a new eval_metric.

#----- XGB OWN AND DEFERRED METRIC
library(caret)
data(iris)

irisbig <- rbind(iris, iris)
for(i in 1:400){
  irisbig <- rbind(irisbig, iris)
}

irisbig$Species <- as.factor(as.numeric(irisbig$Species))

inTrain <- createDataPartition(irisbig$Species, p = 0.70 , list = FALSE)
trainDat <- irisbig[ inTrain, ]
testDat <- irisbig[ -inTrain, ]

## MODEL A - NO XGB METRIC
set.seed(6879)

bootControl <- trainControl(number=10)

xgbGrid <- expand.grid(
  eta = 0.3,
  max_depth = 1,
  nrounds = 50,
  gamma = 0,
  colsample_bytree = 0.6,
  min_child_weight = 1
)

modFitxgb_A <-  train(
  Species ~ .,
  data = trainDat,
  trControl = bootControl,
  tuneGrid = xgbGrid,
  metric = "Accuracy",
  method = "xgbTree",
  verbose = 1,
  num_class = 3
)

modFitxgb_A

Which produces this result:

eXtreme Gradient Boosting 

42210 samples
    4 predictor
    3 classes: '1', '2', '3' 

No pre-processing
Resampling: Bootstrapped (10 reps) 
Summary of sample sizes: 42210, 42210, 42210, 42210, 42210, 42210, ... 
Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD  
  0.9932423  0.9898633  0.005744415  0.00861682

Tuning parameter 'nrounds' was held constant at a value of
 value of 0.6
Tuning parameter 'min_child_weight' was held
 constant at a value of 1

Now, I define a new objective and a eval_metric(both are parameters that can be defined for xgboost), same xgbGridand same dataset:

## MODEL B - XGB METRIC AND OBJECTIVE
set.seed(6879)

bootControl <- trainControl(number=10)

xgbGrid <- expand.grid(
  eta = 0.3,
  max_depth = 1,
  nrounds = 50,
  gamma = 0,
  colsample_bytree = 0.6,
  min_child_weight = 1
)

modFitxgb_B <-  train(
  Species ~ .,
  data = trainDat,
  trControl = bootControl,
  tuneGrid = xgbGrid,
  metric = "Accuracy",
  method = "xgbTree",
  verbose = 1,
  objective = "multi:softmax",
  eval_metric = "mlogloss",
  num_class = 3
)

modFitxgb_B

Which produces an error:

Error in { : 
  task 1 failed - "arguments imply differing number of rows: 5194, 15581"
Además: Warning messages:
1: In matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) :
  la longitud de los datos [15581] no es un submúltiplo o múltiplo del número de filas [5194] en la matriz
2: In matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) :
  la longitud de los datos [15553] no es un submúltiplo o múltiplo del número de filas [5185] en la matriz
3: In matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) :
  la longitud de los datos [15467] no es un submúltiplo o múltiplo del número de filas [5156] en la matriz
4: In matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) :
  la longitud de los datos [15575] no es un submúltiplo o múltiplo del número de filas [5192] en la matriz
5: In matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) :
  la longitud de los datos [15424] no es un submúltiplo o múltiplo del número de filas [5142] en la matriz

This error could be related to the objective (??). Without that parameter there model runs and gets:

## MODEL B - XGB METRIC AND OBJECTIVE
set.seed(6879)

bootControl <- trainControl(number=10)

xgbGrid <- expand.grid(
  eta = 0.3,
  max_depth = 1,
  nrounds = 50,
  gamma = 0,
  colsample_bytree = 0.6,
  min_child_weight = 1
)

modFitxgb_B <-  train(
  Species ~ .,
  data = trainDat,
  trControl = bootControl,
  tuneGrid = xgbGrid,
  metric = "Accuracy",
  method = "xgbTree",
  verbose = 1,
  #objective = "multi:softmax",
  eval_metric = "mlogloss",
  num_class = 3
)

modFitxgb_B

eXtreme Gradient Boosting 

42210 samples
    4 predictor
    3 classes: '1', '2', '3' 

No pre-processing
Resampling: Bootstrapped (10 reps) 
Summary of sample sizes: 42210, 42210, 42210, 42210, 42210, 42210, ... 
Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD  
  0.9932423  0.9898633  0.005744415  0.00861682

Tuning parameter 'nrounds' was held constant at a value of 50
Tuning
 held constant at a value of 0.6
Tuning parameter 'min_child_weight' was
 held constant at a value of 1

Which basically yields the same Accuracy as the first version with no xgboost metric.

I know that in trainControl() you can define your own metric. But, is it not possible to pass the metric already defined in the function (xgboostin this case) with a parameter?.

And regarding the error that appears when defined the objective, what could be its explanation ?.

> version
               _                           
platform       x86_64-apple-darwin13.4.0   
arch           x86_64                      
os             darwin13.4.0                
system         x86_64, darwin13.4.0        
status                                     
major          3                           
minor          2.3                         
year           2015                        
month          12                          
day            10                          
svn rev        69752                       
language       R                           
version.string R version 3.2.3 (2015-12-10)
nickname       Wooden Christmas-Tree       
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

locale:
[1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plyr_1.8.3      xgboost_0.4-3   caret_6.0-64    ggplot2_2.1.0  
[5] lattice_0.20-33 deepboost_0.1.4

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3        magrittr_1.5       splines_3.2.3      MASS_7.3-45       
 [5] munsell_0.4.3      colorspace_1.2-6   foreach_1.4.3      minqa_1.2.4       
 [9] stringr_1.0.0      car_2.1-1          tools_3.2.3        parallel_3.2.3    
[13] nnet_7.3-12        pbkrtest_0.4-6     grid_3.2.3         data.table_1.9.6  
[17] gtable_0.2.0       nlme_3.1-125       mgcv_1.8-11        quantreg_5.21     
[21] e1071_1.6-7        class_7.3-14       MatrixModels_0.4-1 iterators_1.0.8   
[25] lme4_1.1-11        Matrix_1.2-4       nloptr_1.0.4       reshape2_1.4.1    
[29] codetools_0.2-14   rsconnect_0.4.1.4  stringi_1.0-1      compiler_3.2.3    
[33] scales_0.4.0       stats4_3.2.3       SparseM_1.7        chron_2.3-47      
> 

Thanks, Carlos.

hubdr commented 8 years ago

According to modelLookup("xgbTree") there seems to be support for passing just these parameters to xgb from caret:

nrounds
max_depth
eta
gamma
colsample_bytree
min_child_weight

objective and eval_metric you're trying to pass it are probably not supported.

Also, I believe the metric that you may custom define in caret is just used by caret to select the best model and again is not passed to xgb as an objective for training purposes. xgb must be using its default values compatible with the parameters it does receive

It seems there may be a way to do what you're asking as described here: http://topepo.github.io/caret/custom_models.html

coforfe commented 8 years ago

Thanks for your reply, but I think that is not fully accurate.

set.seed(825)
gbmFit1 <- train(Class ~ ., data = training,
                 method = "gbm",
                 trControl = fitControl,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)
gbmFit1

I have already used this feature to perform a more model fine-tuning with parameters that are not in the training for xgboost as well as other type of models: ranger, gbm, etc. But it's true that for a metric parameter perhaps I am mistaken.

Thanks, Carlos.

topepo commented 8 years ago

Sorry for the late response...

While the three dots do point to the xgb.train call, we automatically set objective to be either "binary:logistic", "multi:softprob", or "reg:linear" depending on the case. I would expect that passing eval_metric to train will send it to xgb.train without issue.

For now, you can make a copy of the model code using getModelInfo and change objective using a custom method.

For some models (including gbm), we make these default choices but let the user override the values. The code is a little more complex but it is doable so I'll add that to the "to do" list. This is a better long-term solution.

Max

coforfe commented 8 years ago

Thanks again Max for the clarifications and the references.

Carlos.

slfan2013 commented 8 years ago

Hi Max,

I wonder if it is possible to set objective = "rank:pairwise" in caret method: "xgbTree"? Or only binary:logistic, multi:softprob and reg:linear can be passed?

Thanks!

DrJerryTAO commented 1 week ago

Note the the link to "a custom method" was broken as of 2024. I believe the new link should go to "Chapter 13 Using Your Own Model in train" https://topepo.github.io/caret/using-your-own-model-in-train.html. Can @topepo confirm?

Would you consider changing the behavior of train(method = "xgbTree") in https://github.com/topepo/caret/blob/master/models/files/xgbTree.R and other xgb models to set default objective argument only when it is not specified in train()?

To make the objective argument effective in train() for xgboost models, I followed the minimal example at https://topepo.github.io/caret/using-your-own-model-in-train.html#Illustration6 and referenced the existing model info of xgbTree objects https://github.com/topepo/caret/blob/master/models/files/xgbTree.R.

xgb_custom <- getModelInfo("xgbTree", regex = FALSE)[[1]]
xgb_custom$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {

  if(!inherits(x, "xgb.DMatrix"))
    x <- xgboost::xgb.DMatrix(x, label = y, missing = NA) else
      xgboost::setinfo(x, "label", y)

  if (!is.null(wts))
    xgboost::setinfo(x, 'weight', wts)

  out <- xgboost::xgb.train(
    list(eta = param$eta,
         max_depth = param$max_depth,
         gamma = param$gamma,
         colsample_bytree = param$colsample_bytree,
         min_child_weight = param$min_child_weight,
         subsample = param$subsample),
    data = x,
    nrounds = param$nrounds,
    # objective = "reg:squarederror",
    ...)
  out
}

Then I can use the newly defined method as in train(y ~ ., data = data, method = xgb_custom, objective = "reg:tweedie", eval_metric = "mae", base_score = median(data$y)) to use other objective functions.