I would like to span a relatively large parameter space, but I want to use the random search method. This means that a random constellation is to be selected from all available parameter values and applied to the data.
cv <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5,
search = "random"
)
random_grid <- expand.grid(
nrounds = 10,
max_depth = seq(1,5, by = 1),
eta = seq(0, 1, by = 0.1),
gamma = seq(0, 5, by = 1),
colsample_bytree = seq(0.5, 1, by = 0.1),
min_child_weight = seq(1, 10, by = 1),
subsample = seq(0.5, 1, by = 0.1)
)
xgbTrained <- caret::train(
type ~.,
data = zooTib,
method = "xgbTree",
trControl = cv,
tuneGrid = random_grid,
tuneLength = 100,
metric = "Accuracy",
verbose = 1
)
However, I always get the following error message:
Error in sample.int(n = 1000000L, size = num_rs * nrow(trainInfo$loop) + : cannot take a sample larger than the population when 'replace = FALSE'
I am now wondering how I can fix this error, or rather how it happens in the first place and what it refers to.
My random_grid dataframe has 118800 observations, so 100 iterations or selections should not be a problem at all. Therefore, I suspect that the problem is not caused by that.
Furthermore there are 101 observations in the dataset. With a 10-fold CV, there should be about 90 observations in the training set and 10 in the test set. I admit that these are quite few observations, but this should not be the problem either. The methodology should be as follows:
Choose randomly one of the 118800 available combinations.
Include this in the XGBoost algorithm and calculate a five times repeated 10-fold cross validation.
Repeat this process tuneLength times
Choose the best combination out of the 100 random ones according to accuracy.
After finding this StackOverflow post, I reduced tuneLength from 100 to 1, as it was recommended here to reduce the value.
However, this did not help either, as the same error occurred again.
Error in sample.int(n = 1000000L, size = num_rs * nrow(trainInfo$loop) + : cannot take a sample larger than the population when 'replace = FALSE'
Then, I have reduced the random_grid strongly from 118800 observations to 144, while tuneLength still equals 1:
random_grid <- expand.grid(
nrounds = 10,
max_depth = seq(1,5, by = 5),
eta = seq(0, 1, by = 0.5),
gamma = seq(0, 5, by = 5),
colsample_bytree = seq(0.5, 1, by = 0.1),
min_child_weight = seq(1, 10, by = 5),
subsample = seq(0.5, 1, by = 0.5)
)
When I now once again executed the above mentioned train function, this actually worked without problems.
Now the question occurs, why it suddenly worked. The random_grid has significantly fewer observations, but still more than the training set zooTib. So, how many parameter constellations are then too many? Why does the size of the random_grid data frame even matter? It basically just offers a range or pool of parameter combinations to choose from.
I would be very happy if someone could explain to me how this error arises, how I can fix it, and what to look for in future tuning strategies.
In any case, thank you very much in advance for your support!
It seem's that caret's train() function is totally ignoring the 'search = "random"' parameter in trainControl() if a tuneGrid is handed to the parameter of the same name. This caused a complete grid search for 118,800 parameter constellations.
There could be a note in the documentation, pointing out that the 'tuneGrid' parameter is overruling the 'search' parameter, if a grid has been handed over to train. In other packages like 'mlr' you can hand over a predefined tuning grid to the train function and set, that you want a random search performed on this grid. This seems not to work in caret.
Hello everyone!
I would like to apply the XGBoost algorithm to my data using the xgbTree method in caret.
My dataset is the Zoo dataset from the mlbench package.
I would like to span a relatively large parameter space, but I want to use the random search method. This means that a random constellation is to be selected from all available parameter values and applied to the data.
However, I always get the following error message:
I am now wondering how I can fix this error, or rather how it happens in the first place and what it refers to.
My random_grid dataframe has 118800 observations, so 100 iterations or selections should not be a problem at all. Therefore, I suspect that the problem is not caused by that.
Furthermore there are 101 observations in the dataset. With a 10-fold CV, there should be about 90 observations in the training set and 10 in the test set. I admit that these are quite few observations, but this should not be the problem either. The methodology should be as follows:
After finding this StackOverflow post, I reduced tuneLength from 100 to 1, as it was recommended here to reduce the value.
However, this did not help either, as the same error occurred again.
Then, I have reduced the random_grid strongly from 118800 observations to 144, while tuneLength still equals 1:
When I now once again executed the above mentioned train function, this actually worked without problems.
Now the question occurs, why it suddenly worked. The random_grid has significantly fewer observations, but still more than the training set zooTib. So, how many parameter constellations are then too many? Why does the size of the random_grid data frame even matter? It basically just offers a range or pool of parameter combinations to choose from.
I would be very happy if someone could explain to me how this error arises, how I can fix it, and what to look for in future tuning strategies.
In any case, thank you very much in advance for your support!
######################################### SOLUTION #########################################
It seem's that caret's train() function is totally ignoring the 'search = "random"' parameter in trainControl() if a tuneGrid is handed to the parameter of the same name. This caused a complete grid search for 118,800 parameter constellations.
There could be a note in the documentation, pointing out that the 'tuneGrid' parameter is overruling the 'search' parameter, if a grid has been handed over to train. In other packages like 'mlr' you can hand over a predefined tuning grid to the train function and set, that you want a random search performed on this grid. This seems not to work in caret.