tidymodels / dials

Tools for creating tuning parameter values
https://dials.tidymodels.org/
Other
113 stars 27 forks source link

grid_max_entropy need better error message for unfinalized parameters #99

Closed SewerynGrodny closed 4 years ago

SewerynGrodny commented 4 years ago

Hi, thanks for great tidymodels packages. (Great job!) While training random forest models, I've encounter an issue with tune grid and parameters. It seems that mtry() is not supported (case 2 in below code). There is also minor problem with show_best function which throw an error if there are NA in .metric.

Best Sewe

Reproducible example

#tidymodels
cars_split = initial_split(mtcars)

car_recipe = recipe(mpg ~., data = training(cars_split)) %>% 
  step_center(all_numeric()) %>% 
  prep()

cars_cv_folds <- training(cars_split) %>% 
  bake(car_recipe, new_data = .) %>%
  vfold_cv(v = 5)

#case 1
# model
rf_model_cars = rand_forest(
  mode = "regression",
  min_n = tune(),
  ) %>% 
  set_engine("ranger")

#params
rf_params_cars = parameters(min_n())
rf_grid_cars = grid_max_entropy(rf_params_cars, size = 20)

# tune
rf_stage_1_cv_results_tbl_oto = tune_grid(
  formula = mpg ~.,
  model = rf_model_cars,
  resamples = cars_cv_folds,
  grid = rf_grid_cars,
  metrics = metric_set(mae, mape, rmse, rsq),
  control = control_grid(verbose = TRUE)
)
# error because of NA
rf_stage_1_cv_results_tbl_oto %>% show_best()

rf_stage_1_cv_results_tbl_oto %>% unnest(.metrics) %>% 
  filter(.metric == "rsq")

# case 2
rf_model_cars = rand_forest(
  mode = "regression",
  min_n = tune(),
  mtry = tune()
) %>% 
  set_engine("ranger")

#params
rf_params_cars = parameters(mtry(), min_n())
#error 
rf_grid_cars = grid_max_entropy(rf_params_cars, size = 20)
topepo commented 4 years ago

mtry depends on the number of columns so the upper part of the range cannot be set. The finalize() method can do this if you pass in the predictors:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────────────────────────────────────────────────── tidymodels 0.0.4 ──
#> ✓ broom     0.5.4     ✓ recipes   0.1.9
#> ✓ dials     0.0.4     ✓ rsample   0.0.5
#> ✓ dplyr     0.8.4     ✓ tibble    2.1.3
#> ✓ ggplot2   3.2.1     ✓ tune      0.0.1
#> ✓ infer     0.5.1     ✓ workflows 0.1.0
#> ✓ parsnip   0.0.5     ✓ yardstick 0.0.5
#> ✓ purrr     0.3.3
#> ── Conflicts ───────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()    masks scales::discard()
#> x dplyr::filter()     masks stats::filter()
#> x dplyr::lag()        masks stats::lag()
#> x ggplot2::margin()   masks dials::margin()
#> x recipes::step()     masks stats::step()
#> x recipes::yj_trans() masks scales::yj_trans()

rf_params_cars = parameters(mtry(), min_n())
rf_params_cars
#> Collection of 2 parameters for tuning
#> 
#>     id parameter type object class
#>   mtry           mtry    nparam[?]
#>  min_n          min_n    nparam[+]
#> 
#> Parameters needing finalization:
#>    # Randomly Selected Predictors ('mtry')
#> 
#> See `?dials::finalize` or `?dials::update.parameters` for more information.

rf_params_cars <- 
  rf_params_cars %>% 
  update(mtry = finalize(mtry(), mtcars %>% select(-mpg)))
rf_params_cars
#> Collection of 2 parameters for tuning
#> 
#>     id parameter type object class
#>   mtry           mtry    nparam[+]
#>  min_n          min_n    nparam[+]

set.seed(131)
rf_grid_cars = grid_max_entropy(rf_params_cars, size = 3)
rf_grid_cars
#> # A tibble: 3 x 2
#>    mtry min_n
#>   <int> <int>
#> 1     4    34
#> 2     9    21
#> 3     2    16

Created on 2020-02-24 by the reprex package (v0.3.0)

We need a better error message though.

(edit - hit wrong key)

topepo commented 4 years ago

I'm going to move this to dials and update the title.

topepo commented 4 years ago

There is also minor problem with show_best function which throw an error if there are NA in .metric.

That's because the message (and entries in the .notes column) tell you that

> rf_stage_1_cv_results_tbl_oto$.notes[[5]]$.notes
[1] "internal: A correlation computation is required, but `estimate` is constant 
and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be 
returned."

This happens when a model predicts the same value for all samples.

The main error in the code was the lack of metric argument:

> rf_stage_1_cv_results_tbl_oto %>% show_best()
Error in check_metric_choice(metric, maximize) : 
  argument "metric" is missing, with no default
> rf_stage_1_cv_results_tbl_oto %>% show_best(metric = "rmse", maximize = FALSE)
# A tibble: 5 x 6
  min_n .metric .estimator  mean     n std_err
  <int> <chr>   <chr>      <dbl> <int>   <dbl>
1     2 rmse    standard    2.06     5   0.292
2     5 rmse    standard    2.24     5   0.286
3     7 rmse    standard    2.49     5   0.271
4     9 rmse    standard    2.65     5   0.247
5    11 rmse    standard    2.94     5   0.204
github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.