topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 632 forks source link

SOLVED and suggestion for improvement: Sample size problems during random search hyperparameter tuning #1281

Open KURolfKF opened 2 years ago

KURolfKF commented 2 years ago

Hello everyone!

I would like to apply the XGBoost algorithm to my data using the xgbTree method in caret.

My dataset is the Zoo dataset from the mlbench package.

data(Zoo, package = "mlbench")
zooTib <- as_tibble(Zoo)

I would like to span a relatively large parameter space, but I want to use the random search method. This means that a random constellation is to be selected from all available parameter values and applied to the data.

cv <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5,
  search = "random"
  )

random_grid <- expand.grid(
  nrounds = 10,
  max_depth = seq(1,5, by = 1),
  eta = seq(0, 1, by = 0.1),
  gamma = seq(0, 5, by = 1),
  colsample_bytree = seq(0.5, 1, by = 0.1),
  min_child_weight = seq(1, 10, by = 1),
  subsample = seq(0.5, 1, by = 0.1)
  )

xgbTrained <- caret::train(
  type ~., 
  data = zooTib, 
  method = "xgbTree",
  trControl = cv,
  tuneGrid = random_grid,
  tuneLength = 100,
  metric = "Accuracy",
  verbose = 1
  )

However, I always get the following error message:

Error in sample.int(n = 1000000L, size = num_rs * nrow(trainInfo$loop) + : cannot take a sample larger than the population when 'replace = FALSE'

I am now wondering how I can fix this error, or rather how it happens in the first place and what it refers to.

My random_grid dataframe has 118800 observations, so 100 iterations or selections should not be a problem at all. Therefore, I suspect that the problem is not caused by that.

Furthermore there are 101 observations in the dataset. With a 10-fold CV, there should be about 90 observations in the training set and 10 in the test set. I admit that these are quite few observations, but this should not be the problem either. The methodology should be as follows:

  1. Choose randomly one of the 118800 available combinations.
  2. Include this in the XGBoost algorithm and calculate a five times repeated 10-fold cross validation.
  3. Repeat this process tuneLength times
  4. Choose the best combination out of the 100 random ones according to accuracy.

After finding this StackOverflow post, I reduced tuneLength from 100 to 1, as it was recommended here to reduce the value.

xgbTrained <- caret::train(
  type ~., 
  data = zooTib, 
  method = "xgbTree",
  trControl = cv,
  tuneGrid = random_grid,
  tuneLength = 1,
  metric = "Accuracy",
  verbose = 1
  )

However, this did not help either, as the same error occurred again.

Error in sample.int(n = 1000000L, size = num_rs * nrow(trainInfo$loop) + : cannot take a sample larger than the population when 'replace = FALSE'

Then, I have reduced the random_grid strongly from 118800 observations to 144, while tuneLength still equals 1:

random_grid <- expand.grid(
  nrounds = 10,
  max_depth = seq(1,5, by = 5),
  eta = seq(0, 1, by = 0.5),
  gamma = seq(0, 5, by = 5),
  colsample_bytree = seq(0.5, 1, by = 0.1),
  min_child_weight = seq(1, 10, by = 5),
  subsample = seq(0.5, 1, by = 0.5)
  )

When I now once again executed the above mentioned train function, this actually worked without problems.

Now the question occurs, why it suddenly worked. The random_grid has significantly fewer observations, but still more than the training set zooTib. So, how many parameter constellations are then too many? Why does the size of the random_grid data frame even matter? It basically just offers a range or pool of parameter combinations to choose from.

I would be very happy if someone could explain to me how this error arises, how I can fix it, and what to look for in future tuning strategies.

In any case, thank you very much in advance for your support!

######################################### SOLUTION #########################################

It seem's that caret's train() function is totally ignoring the 'search = "random"' parameter in trainControl() if a tuneGrid is handed to the parameter of the same name. This caused a complete grid search for 118,800 parameter constellations.

There could be a note in the documentation, pointing out that the 'tuneGrid' parameter is overruling the 'search' parameter, if a grid has been handed over to train. In other packages like 'mlr' you can hand over a predefined tuning grid to the train function and set, that you want a random search performed on this grid. This seems not to work in caret.