Closed Tadge-Analytics closed 1 year ago
I'm actually attempting to tune a series of models and this error only happens for some of them but not others. I can't see what would be the differentiating factor(s) between them though. I'll keep looking (open for suggestions) on what parts to look in. ps, it's an xgboost model.
Hi @topepo and @juliasilge, have created a smaller reprex. Hope this makes it easier for you to get a sense of what's going on.
library(tidyverse)
library(tidymodels)
library(xgboost)
#>
#> Attaching package: 'xgboost'
#> The following object is masked from 'package:dplyr':
#>
#> slice
library(doParallel)
#> Loading required package: foreach
#>
#> Attaching package: 'foreach'
#> The following objects are masked from 'package:purrr':
#>
#> accumulate, when
#> Loading required package: iterators
#> Loading required package: parallel
options(tidymodels.dark = TRUE)
###################################################################
data_import_prep <-
read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv") %>%
select(peak_id, year, season, sex, age, citizenship, hired, success, died) %>%
mutate_if(is.character, factor) %>%
mutate_if(is.logical, as.integer) %>%
mutate(outcome = if_else(died == "TRUE", "Yes", "No") %>% factor(levels = c("Yes", "No"))) %>%
select(-died)
#> Rows: 76519 Columns: 21
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (10): expedition_id, member_id, peak_id, peak_name, season, sex, citizen...
#> dbl (5): year, age, highpoint_metres, death_height_metres, injury_height_me...
#> lgl (6): hired, success, solo, oxygen_used, died, injured
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
###################################################################
model_spec <-
boost_tree(
trees = tune()
, tree_depth = tune()
, min_n = tune()
, loss_reduction = tune()
, sample_size = tune()
, mtry = tune()
, learn_rate = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
recipe_to_use <-
recipe(outcome ~ ., data = data_import_prep) %>%
step_impute_median(age) %>%
step_other(peak_id, citizenship) %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = T)
tuning_grid_size <- 5
tuning_grid <-
grid_latin_hypercube(
trees(range = c(500, 2000))
, tree_depth()
, min_n()
, loss_reduction()
, sample_prop()
, finalize(mtry(), recipe_to_use %>% prep() %>% juice())
, learn_rate(range = c(-4, -1))
, size = tuning_grid_size
)
set.seed(123)
data_name_folds <- vfold_cv(data_import_prep, strata = outcome)
initial_wf <-
workflow() %>%
add_recipe(recipe_to_use) %>%
add_model(model_spec)
###################################################################
tictoc::tic()
cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores-1)
registerDoParallel(cl)
# estimated time for tune_race_anova is 3-5 mins
# with the above parrelization on a 6 core CPU
# sorry :-)
set.seed(123)
tune_race_tuned_grid <-
finetune::tune_race_anova(
initial_wf,
resamples = data_name_folds,
grid = tuning_grid,
metrics = metric_set(mn_log_loss),
control = finetune::control_race(verbose = TRUE)
)
doParallel::stopImplicitCluster()
tictoc::toc()
#> 239.67 sec elapsed
###################################################################
sim_anneal_iterations <- 2
set.seed(123)
anneal_tuned_grid <-
finetune::tune_sim_anneal(
initial_wf,
param_info = initial_wf %>% extract_parameter_set_dials() %>% finalize(data_import_prep),
resamples = data_name_folds,
initial = tune_race_tuned_grid,
metrics = metric_set(mn_log_loss),
iter = sim_anneal_iterations)
#> Optimizing mn_log_loss
#> Initial best: 0.00031
#> Error in `.f()`:
#> ! Values should be on [0, 1].
#> ℹ This is an internal error that was detected in the dials package.
#> Please report it at <https://github.com/tidymodels/dials/issues> with a reprex (<https://https://tidyverse.org/help/>) and the full backtrace.
#> Backtrace:
#> ▆
#> 1. ├─finetune::tune_sim_anneal(...)
#> 2. ├─finetune:::tune_sim_anneal.workflow(...)
#> 3. │ └─finetune:::tune_sim_anneal_workflow(...)
#> 4. │ ├─... %>% ...
#> 5. │ └─finetune:::new_in_neighborhood(...)
#> 6. │ └─finetune:::random_integer_neighbor(...)
#> 7. │ └─finetune:::sample_by_distance(...)
#> 8. │ └─finetune:::encode_set_backwards(candidates, pset)
#> 9. │ └─purrr::map2(pset$object, x, dials::encode_unit, direction = "backward")
#> 10. │ ├─dials (local) .f(.x[[i]], .y[[i]], ...)
#> 11. │ └─dials:::encode_unit.quant_param(.x[[i]], .y[[i]], ...)
#> 12. │ └─rlang::abort("Values should be on [0, 1].", .internal = TRUE)
#> 13. └─dplyr::mutate(., .config = paste0("iter", i), .parent = current_parent)
#> ✖ Optimization stopped prematurely; returning current results.
Created on 2022-11-02 with reprex v2.0.2
Hi @tadge-analytics, thanks for taking the time to report this and provide a reprex!
From what I can tell, dials is doing what it is supposed to be doing here. At the point it breaks, it gets handed a parameter object (mtry
) and a range of values to transform back from [0 1] but those values are larger than 1.
library(dials)
#> Loading required package: scales
p1 <- structure(list(type = "integer",
range = list(lower = 1L, upper = 9L),
inclusive = c(lower = TRUE, upper = TRUE),
trans = NULL,
label = c(mtry = "# Randomly Selected Predictors"),
finalize = NULL),
class = c("quant_param", "param"))
x1 <- structure(c(1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75,
# [more 1.75 values]
1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75), assign = 1:4)
dials::encode_unit(p1, x1, direction = "backward")
#> Error in `dials::encode_unit()`:
#> ! Values should be on [0, 1].
#> ℹ This is an internal error that was detected in the dials package.
#> Please report it at <https://github.com/tidymodels/dials/issues> with a reprex (<https://https://tidyverse.org/help/>) and the full backtrace.
#> Backtrace:
#> ▆
#> 1. ├─dials::encode_unit(p1, x1, direction = "backward")
#> 2. └─dials:::encode_unit.quant_param(p1, x1, direction = "backward") at dials/R/encode_unit.R:23:2
#> 3. └─rlang::abort("Values should be on [0, 1].", .internal = TRUE) at dials/R/encode_unit.R:50:6
Created on 2022-11-04 with reprex v2.0.2
So I had a closer look at mtry
, trying to find a reason why the values could be off.
I did notice that you finalize mtry
with two different datasets in your reprex, thus leading to different ranges:
tuning_grid
, the grid for the anova race, it's recipe_to_use %>% prep() %>% juice())
which leads to a range of [1, 31] for the parameterfinetune::tune_sim_anneal()
itself, you use data_import_prep
which leads to a range of [1, 9]finetune::tune_sim_anneal()
gets passed initial
tune results which use the wider range, along with the param_info
which uses the smaller range - and then breaks.
If you use recipe_to_use %>% prep() %>% juice())
to finalize mtry
for the param_info
arg of finetune::tune_sim_anneal()
it works.
@topepo is this (= the situation described in the comment above) to be expected or should this work? if that's to be expected, maybe we can catch that error more elegantly in finetune
? See below for smaller reprex.
library(tidymodels)
set.seed(1)
rf_spec <- rand_forest(mode = "regression", mtry = tune())
grid_with_bigger_range <- grid_latin_hypercube(mtry(range = c(1, 16)))
car_folds <- vfold_cv(car_prices, v = 2)
car_wflow <- workflow() %>%
add_formula(Price ~ .) %>%
add_model(rf_spec)
tune_res_with_bigger_range <- tune_grid(
car_wflow,
resamples = car_folds,
grid = grid_with_bigger_range
)
parameter_set_with_smaller_range <- parameters(mtry(range = c(1, 5)))
finetune::tune_sim_anneal(
car_wflow,
param_info = parameter_set_with_smaller_range,
resamples = car_folds,
initial = tune_res_with_bigger_range,
iter = 2
)
#> Optimizing rmse
#> Initial best: 2570.90000
#> Error in `.f()`:
#> ! Values should be on [0, 1].
#> ℹ This is an internal error that was detected in the dials package.
#> Please report it at <https://github.com/tidymodels/dials/issues> with a reprex (<https://https://tidyverse.org/help/>) and the full backtrace.
#> Backtrace:
#> ▆
#> 1. ├─finetune::tune_sim_anneal(...)
#> 2. ├─finetune:::tune_sim_anneal.workflow(...)
#> 3. │ └─finetune:::tune_sim_anneal_workflow(...)
#> 4. │ ├─... %>% ...
#> 5. │ └─finetune:::new_in_neighborhood(...)
#> 6. │ └─finetune:::random_integer_neighbor(...)
#> 7. │ └─finetune:::sample_by_distance(...)
#> 8. │ └─finetune:::encode_set_backwards(candidates, pset)
#> 9. │ └─purrr::map2(pset$object, x, dials::encode_unit, direction = "backward")
#> 10. │ ├─dials (local) .f(.x[[1L]], .y[[1L]], ...)
#> 11. │ └─dials:::encode_unit.quant_param(.x[[1L]], .y[[1L]], ...) at dials/R/encode_unit.R:23:2
#> 12. │ └─rlang::abort("Values should be on [0, 1].", .internal = TRUE) at dials/R/encode_unit.R:50:6
#> 13. └─dplyr::mutate(., .config = paste0("iter", i), .parent = current_parent)
#> ✖ Optimization stopped prematurely; returning current results.
#> # Tuning results
#> # 2-fold cross-validation
#> # A tibble: 2 × 5
#> splits id .metrics .notes .iter
#> <list> <chr> <list> <list> <int>
#> 1 <split [402/402]> Fold1 <tibble [6 × 5]> <tibble [0 × 3]> 0
#> 2 <split [402/402]> Fold2 <tibble [6 × 5]> <tibble [0 × 3]> 0
Created on 2022-11-04 with reprex v2.0.2
Thanks heaps @hfrick , I'm looking forward to trying this suggestion out... and yes, I suspect it will work correctly with that change you recommended. Looking back now, I can't remember why I use the latter finalise method instead of the former. I probably copy and pasted some code.
Thanks for that @hfrick , I think my error was even simpler. Basically, I was running off the following blog post: https://uliniemann.com/blog/2022-07-04-comparing-hyperparameter-tuning-strategies-with-tidymodels/
Inside I saw the following:
What I really needed to do was the following:
and then:
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
Hi there @topepo,
I get the following error when I run:
finetune::tune_sim_anneal( initial_wf, param_info = initial_wf %>% extract_parameter_set_dials() %>% finalize(tsr_data), resamples = data_name_folds, initial = pre_existing_tuned_grid, metrics = metric_set(roc_auc, mn_log_loss), iter = sim_anneal_iterations)