mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 405 forks source link

post a mini use case that shows tuning xgboost with mbo. #2307

Closed pat-s closed 4 years ago

pat-s commented 6 years ago

From @berndbischl on March 14, 2017 16:0

if you want to be cool, even show that with custom loss

Copied from original issue: mlr-org/mlr-tutorial#102

pat-s commented 6 years ago

This could/should probably combined with a short comparison of all available tuning methods. Maybe reducing grid + random search and focus on the explanation of the more complex ones?

SimonCoulombe commented 5 years ago

Something like this?

We are going to create a poisson regression using xgboost to model the number of claims a person makes during her insurance coverage. Insurance coverage have variable exposure between 0 and 1 year.

We are going to force a variable to have a monotonously decreasing effect. In this case we are going to force "the older the vehicle, the less frequent the claims". It doesnt really make sense, but it is fun to demonstrate.


library(xgboost)
library(insuranceData) # example dataset https://cran.r-project.org/web/packages/insuranceData/insuranceData.pdf
library(tidyverse)
library(mlrMBO)
library(rBayesianOptimization) for cv folds
data(dataCar)
mydb <- dataCar %>% select(numclaims, exposure, veh_value, veh_body,
                           veh_age, gender, area, agecat)

label_var <- "numclaims"    
offset_var <- "exposure"
feature_vars <- mydb %>% 
  select(-one_of(c(label_var, offset_var))) %>% 
  colnames()

#preparing data for xgboost (one hot encoding of categorical (factor) data
myformula <- paste0( "~", paste0( feature_vars, collapse = " + ") ) %>% as.formula()
dummyFier <- caret::dummyVars(myformula, data=mydb, fullRank = TRUE)
dummyVars.df <- predict(dummyFier,newdata = mydb)
mydb_dummy <- cbind(mydb %>% select(one_of(c(label_var, offset_var))), 
                    dummyVars.df)
rm(myformula, dummyFier, dummyVars.df)

feature_vars_dummy <-  mydb_dummy  %>% select(-one_of(c(label_var, offset_var))) %>% colnames()

# create xgb.matrix for xgboost consumption
mydb_xgbmatrix <- xgb.DMatrix(
  data = mydb_dummy %>% select(feature_vars_dummy) %>% as.matrix, 
  label = mydb_dummy %>% pull(label_var),
  missing = "NAN")

#base_margin: base margin is the base prediction Xgboost will boost from  (ie: exposure)
setinfo(mydb_xgbmatrix,"base_margin", mydb %>% pull(offset_var) %>% log() )

# une contrainte bidon, juste pour illustrer comment faire.  plus ton char est vieux, moins de claims ?
myConstraint   <- data_frame(Variable = feature_vars_dummy) %>%
  mutate(sens = ifelse(Variable == "veh_age", -1, 0))

#  xgb.cv folds
cv_folds = rBayesianOptimization::KFold(mydb_dummy$numclaims,
                                        nfolds= 3,
                                        stratified = TRUE,
                                        seed= 0)

# objective function: we want to maximise the log likelihood by tuning most parameters
obj.fun  <- makeSingleObjectiveFunction(
  name = "xgb_cv_bayes",
  fn =   function(x){
    set.seed(1234)
    cv <- xgb.cv(params = list(
      booster = "gbtree",
      eta = x["eta"],
      max_depth = x["max_depth"],
      min_child_weight = x["min_child_weight"],
      gamma = x["gamma"],
      subsample =x["subsample"],
      colsample_bytree = x["colsample_bytree"],
      objective = 'count:poisson', 
      eval_metric = "poisson-nloglik"),
      data = mydb_xgbmatrix,
      nround = 30,
      folds=  cv_folds,
      monotone_constraints = myConstraint$sens,
      prediction = FALSE,
      showsd = TRUE,
      early_stopping_rounds = 10,
      verbose = 0)

    cv$evaluation_log[, max(test_poisson_nloglik_mean)]
  },
  par.set = makeParamSet(
    makeNumericParam("eta", lower = 0.001, upper = 0.05),
    makeNumericParam("gamma", lower = 0, upper = 5),
    makeIntegerParam("max_depth", lower= 1, upper = 10),
    makeIntegerParam("min_child_weight", lower= 1, upper = 10),
    makeNumericParam("subsample", lower = 0.2, upper = 1),
    makeNumericParam("colsample_bytree", lower = 0.2, upper = 1)
  ),
  minimize = FALSE
)

# generate an optimal design with only 10 points
des = generateDesign(n=10,par.set = getParamSet(obj.fun), fun = lhs::randomLHS) 

# i have my own favorite parameters that I really want to get tested
simon_params <- data.frame(max_depth = 6,
                           colsample_bytree= 0.8,
                           subsample = 0.8,
                           min_child_weight = 3,
                           eta  = 0.01,
                           gamma = 0) %>% as_tibble()

#final design  is a combination of latin hypercube optimization and my own knwledge
final_design =  simon_params  %>% bind_rows(des)

# bayes will have 10 iterations
control = makeMBOControl()
control = setMBOControlTermination(control, iters = 10)

# run this!
run = mbo(fun = obj.fun, 
          design = final_design,  
          control = control, 
          show.info = TRUE)

# print best model hyperparameters
run$x
pat-s commented 4 years ago

Possible task to add to the {mlr3book} as a use-case once {mlr3mbo} is ready to use.