Repeated CV & HyperParameter Tunning for XGBoost using caret in Forecast Function takes ages

edgBR commented 4 years ago

Dear Nick,

I am trying to wrap my caret hyperarameter tunning in the forecastML model function as follows:

model_function_xgboost <- function(data, outcome_col_name) {

  set.seed(224)

  fitControl <- trainControl(## 10-fold CV
    method = "repeatedcv",
    number = 5,
    ## repeated ten times
    repeats = 3,
    verboseIter = TRUE)

  model <- train(outcome_col_name ~., 
                 data = data %>% drop_na(),
                 method = "xgbTree",
                 trControl = fitControl,
                 nthread = 28)

  return(model)
}

Usually when use caret you can specify the parallel training for your train/test splits using the allowParallel in the fit control function but you can also set the nthread parameters for xgboost.

If would like to also do hyperparameter tunning I probably will do the folllwing:

model_function_xgboost <- function(data, outcome_col = 1) {

  set.seed(224)

  fitControl <- trainControl(## 10-fold CV
    method = "repeatedcv",
    number = 5,
    ## repeated ten times
    repeats = 3,
    verboseIter = TRUE)

  tune_grid <- expand.grid(
    nrounds = seq(from = 50, to = 1000, by = 50),
    eta = c(0.025, 0.05, 0.1, 0.3),
    max_depth = c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
    gamma = 0,
    colsample_bytree =  c(0.4, 0.6, 0.8, 1.0),
    min_child_weight = c(1, 2, 3),
    subsample = c(0.5, 0.75, 1.0)
  )
  model <- train(global_demand_cleaned ~., 
                 data = data %>% drop_na(),
                 method = "xgbTree",
                 trControl = fitControl,
                 tuneGrid =tune_grid,
                 nthread = 28)

  return(model)
}

The only problem that I am facing is that the training is taking ages. I am not aware if this is because the future library is conflicting with caret or with the nthread parameter or if this is something else (I am using 28cores btw).

Have you ever experienced a similar situation?

BR /Edgar

nredell commented 4 years ago

Not sure off of the top of my head. My guess is that it's related to the size of the datasets being passed around. The train_model() function is a fairly thin wrapper. I'll take a look at the future.apply package, which is what I'm using internally, and see if things could be sped up with respect to explicitly identifying and passing global objects internally.

edgBR commented 4 years ago

Hi Nick,

Thanks for your great support. If you also know another way of tuning the hyperparameters that is working with decent speed I am all ears.

BR /Edgar

edgBR commented 4 years ago

Hi Nick I found a very non standard way to solve this problem.

I wrapped my training code into a docker image and push it to Amazon ECR.

I used the BYOM functionality of AWS Sagemaker to train my models at scale.

I modified my training code to read the hyperparameters from a JSON file that Sagemaker mounts in /opt/ml/directory

I parsed my hyperparameters to the estimator object and after to the tuner object.

I launched 50hpo jobs with early stopping in single M5 large machines and it finished in 2 hours.

If my company allows it I will update the example notebooks.

BR /Edgar

edgBR commented 4 years ago

Closing the issue.

Disclaimer: if you do not have sagemaker or any other tool that allows you to perform HPO in parallel (ex. katib) you will not be able to implement my solution

BR /Edgar

nredell / forecastML

Repeated CV & HyperParameter Tunning for XGBoost using caret in Forecast Function takes ages #40