topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 634 forks source link

grid of mtry values while training random forests with ranger #1290

Open abhicc opened 2 years ago

abhicc commented 2 years ago

Hello.

I am working with a subset of the 'Ames Housing' dataset and have originally 17 features. Using the 'recipes' package, I have preprocessed the original features and created dummy variables for nominal predictors with the following code. That has resulted in 36 features in the 'baked_train' dataset below.

blueprint <- recipe(Sale_Price ~ ., data = _train) %>%
step_nzv(Street, Utilities, Pool_Area, Screen_Porch, Misc_Val) %>% step_impute_knn(Gr_Liv_Area) %>% step_integer(Overall_Qual) %>% step_normalize(all_numeric_predictors()) %>% step_other(Neighborhood, threshold = 0.01, other = "other") %>% step_dummy(all_nominal_predictors(), one_hot = FALSE)

prepare <- prep(blueprint, data = ames_train)

baked_train <- bake(prepare, new_data = ames_train)

baked_test <- bake(prepare, new_data = ames_test)**

Now, I am trying to train random forests with the 'ranger' package using the following code.

cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5)

param_grid_rf <- expand.grid(mtry = seq(1, 36, 1), splitrule = "variance", min.node.size = 2)

rf_cv <- train(blueprint, data = ames_train, method = "ranger", trControl = cv_specs, tuneGrid = param_grid_rf, metric = "RMSE")

Notice that I have set the grid of 'mtry' values based on the number of features in the 'baked_train' data. It is my understanding that 'caret' will apply the blueprint within each resample of 'ames_train' creating a baked version at each CV step.

The text Hands-On Machine Learning with R by Boehmke & Greenwell says on section 3.8.3,

Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.

However, when I run the code above I get an error,

mtry can not be larger than number of variables in data. Ranger will EXIT now.

I get the same error when I specify 'tuneLength = 20' instead of the 'tuneGrid'. Although the code works fine when the grid of 'mtry' values is specified to be from 1 to 17 (the number of features in the original training data 'ames_train').

Can you please point out what I am missing here? Specifically, why do I have to specify the number of features in 'ames_train' instead of 'baked_train' when essentially 'caret' is supposed to create a baked version before fitting and evaluating the model for each resample?

Thanks.

bappa10085 commented 2 years ago

The error simply tells that the mtry can not be larger than number of variables in data. You data contains 17 independent variable as you have mentioned. But you are trying to get optimum mtry values out of seq(1, 36, 1) that is 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 which is more than the number of variable in your data (i.e. 17). mtry should be <= no of variables in data i.e. <= 17 for your case.