tidymodels / dials

Tools for creating tuning parameter values
https://dials.tidymodels.org/
Other
111 stars 26 forks source link

Error in `.f()`: ! Values should be on [0, 1]. #258

Closed Tadge-Analytics closed 1 year ago

Tadge-Analytics commented 1 year ago

Hi there @topepo,

I get the following error when I run:

finetune::tune_sim_anneal( initial_wf, param_info = initial_wf %>% extract_parameter_set_dials() %>% finalize(tsr_data), resamples = data_name_folds, initial = pre_existing_tuned_grid, metrics = metric_set(roc_auc, mn_log_loss), iter = sim_anneal_iterations)

image

Tadge-Analytics commented 1 year ago

I'm actually attempting to tune a series of models and this error only happens for some of them but not others. I can't see what would be the differentiating factor(s) between them though. I'll keep looking (open for suggestions) on what parts to look in. ps, it's an xgboost model.

Tadge-Analytics commented 1 year ago

Hi @topepo and @juliasilge, have created a smaller reprex. Hope this makes it easier for you to get a sense of what's going on.

library(tidyverse)
library(tidymodels)
library(xgboost)
#> 
#> Attaching package: 'xgboost'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice
library(doParallel)
#> Loading required package: foreach
#> 
#> Attaching package: 'foreach'
#> The following objects are masked from 'package:purrr':
#> 
#>     accumulate, when
#> Loading required package: iterators
#> Loading required package: parallel

options(tidymodels.dark = TRUE)

###################################################################

data_import_prep <- 
  read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv") %>% 
  select(peak_id, year, season, sex, age, citizenship, hired, success, died) %>% 
  mutate_if(is.character, factor) %>%
  mutate_if(is.logical, as.integer) %>% 
  mutate(outcome = if_else(died == "TRUE", "Yes", "No") %>% factor(levels = c("Yes", "No"))) %>% 
  select(-died)
#> Rows: 76519 Columns: 21
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (10): expedition_id, member_id, peak_id, peak_name, season, sex, citizen...
#> dbl  (5): year, age, highpoint_metres, death_height_metres, injury_height_me...
#> lgl  (6): hired, success, solo, oxygen_used, died, injured
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

###################################################################

model_spec <- 
  boost_tree(
    trees = tune()
    , tree_depth = tune()
    , min_n = tune()
    , loss_reduction = tune()
    , sample_size = tune() 
    , mtry = tune()
    , learn_rate = tune()
  ) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

recipe_to_use <- 
  recipe(outcome ~ ., data = data_import_prep) %>% 
  step_impute_median(age) %>%
  step_other(peak_id, citizenship) %>%
  step_novel(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = T) 

tuning_grid_size <- 5

tuning_grid <- 
  grid_latin_hypercube(
    trees(range = c(500, 2000))
    , tree_depth()
    , min_n()
    , loss_reduction()
    , sample_prop() 
    , finalize(mtry(), recipe_to_use %>% prep() %>% juice())
    , learn_rate(range = c(-4, -1))
    , size = tuning_grid_size
  )

set.seed(123)

data_name_folds <- vfold_cv(data_import_prep, strata = outcome)

initial_wf <-
  workflow() %>% 
  add_recipe(recipe_to_use) %>% 
  add_model(model_spec)

###################################################################

tictoc::tic()

cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores-1)
registerDoParallel(cl)

# estimated time for tune_race_anova is 3-5 mins
# with the above parrelization on a 6 core CPU
# sorry :-)

set.seed(123)

tune_race_tuned_grid <- 
  finetune::tune_race_anova(
    initial_wf,
    resamples = data_name_folds,
    grid = tuning_grid,
    metrics = metric_set(mn_log_loss),
    control = finetune::control_race(verbose = TRUE)
  )

doParallel::stopImplicitCluster()

tictoc::toc()
#> 239.67 sec elapsed

###################################################################

sim_anneal_iterations <- 2

set.seed(123)

anneal_tuned_grid <- 
  finetune::tune_sim_anneal(
    initial_wf,
    param_info = initial_wf %>% extract_parameter_set_dials() %>% finalize(data_import_prep),
    resamples = data_name_folds,
    initial = tune_race_tuned_grid,
    metrics = metric_set(mn_log_loss),
    iter = sim_anneal_iterations)
#> Optimizing mn_log_loss
#> Initial best: 0.00031
#> Error in `.f()`:
#> ! Values should be on [0, 1].
#> ℹ This is an internal error that was detected in the dials package.
#>   Please report it at <https://github.com/tidymodels/dials/issues> with a reprex (<https://https://tidyverse.org/help/>) and the full backtrace.

#> Backtrace:
#>      ▆
#>   1. ├─finetune::tune_sim_anneal(...)
#>   2. ├─finetune:::tune_sim_anneal.workflow(...)
#>   3. │ └─finetune:::tune_sim_anneal_workflow(...)
#>   4. │   ├─... %>% ...
#>   5. │   └─finetune:::new_in_neighborhood(...)
#>   6. │     └─finetune:::random_integer_neighbor(...)
#>   7. │       └─finetune:::sample_by_distance(...)
#>   8. │         └─finetune:::encode_set_backwards(candidates, pset)
#>   9. │           └─purrr::map2(pset$object, x, dials::encode_unit, direction = "backward")
#>  10. │             ├─dials (local) .f(.x[[i]], .y[[i]], ...)
#>  11. │             └─dials:::encode_unit.quant_param(.x[[i]], .y[[i]], ...)
#>  12. │               └─rlang::abort("Values should be on [0, 1].", .internal = TRUE)
#>  13. └─dplyr::mutate(., .config = paste0("iter", i), .parent = current_parent)
#> ✖ Optimization stopped prematurely; returning current results.

Created on 2022-11-02 with reprex v2.0.2

hfrick commented 1 year ago

Hi @tadge-analytics, thanks for taking the time to report this and provide a reprex!

From what I can tell, dials is doing what it is supposed to be doing here. At the point it breaks, it gets handed a parameter object (mtry) and a range of values to transform back from [0 1] but those values are larger than 1.

library(dials)
#> Loading required package: scales

p1 <- structure(list(type = "integer",
                     range = list(lower = 1L, upper = 9L), 
                     inclusive = c(lower = TRUE, upper = TRUE), 
                     trans = NULL, 
                     label = c(mtry = "# Randomly Selected Predictors"), 
                     finalize = NULL), 
                class = c("quant_param", "param"))

x1 <- structure(c(1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 
                  # [more 1.75 values]
                  1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75), assign = 1:4)

dials::encode_unit(p1, x1, direction = "backward")
#> Error in `dials::encode_unit()`:
#> ! Values should be on [0, 1].
#> ℹ This is an internal error that was detected in the dials package.
#>   Please report it at <https://github.com/tidymodels/dials/issues> with a reprex (<https://https://tidyverse.org/help/>) and the full backtrace.

#> Backtrace:
#>     ▆
#>  1. ├─dials::encode_unit(p1, x1, direction = "backward")
#>  2. └─dials:::encode_unit.quant_param(p1, x1, direction = "backward") at dials/R/encode_unit.R:23:2
#>  3.   └─rlang::abort("Values should be on [0, 1].", .internal = TRUE) at dials/R/encode_unit.R:50:6

Created on 2022-11-04 with reprex v2.0.2

hfrick commented 1 year ago

So I had a closer look at mtry, trying to find a reason why the values could be off.

I did notice that you finalize mtry with two different datasets in your reprex, thus leading to different ranges:

finetune::tune_sim_anneal() gets passed initial tune results which use the wider range, along with the param_info which uses the smaller range - and then breaks.

If you use recipe_to_use %>% prep() %>% juice()) to finalize mtry for the param_info arg of finetune::tune_sim_anneal() it works.

hfrick commented 1 year ago

@topepo is this (= the situation described in the comment above) to be expected or should this work? if that's to be expected, maybe we can catch that error more elegantly in finetune? See below for smaller reprex.

library(tidymodels)

set.seed(1)

rf_spec <- rand_forest(mode = "regression", mtry = tune())

grid_with_bigger_range <- grid_latin_hypercube(mtry(range = c(1, 16)))

car_folds <- vfold_cv(car_prices, v = 2)

car_wflow <- workflow() %>% 
  add_formula(Price ~ .) %>% 
  add_model(rf_spec)

tune_res_with_bigger_range <- tune_grid(
  car_wflow, 
  resamples = car_folds,
  grid = grid_with_bigger_range
)
parameter_set_with_smaller_range <- parameters(mtry(range = c(1, 5)))

finetune::tune_sim_anneal(
  car_wflow,
  param_info = parameter_set_with_smaller_range,
  resamples = car_folds,
  initial = tune_res_with_bigger_range,
  iter = 2
)
#> Optimizing rmse
#> Initial best: 2570.90000
#> Error in `.f()`:
#> ! Values should be on [0, 1].
#> ℹ This is an internal error that was detected in the dials package.
#>   Please report it at <https://github.com/tidymodels/dials/issues> with a reprex (<https://https://tidyverse.org/help/>) and the full backtrace.

#> Backtrace:
#>      ▆
#>   1. ├─finetune::tune_sim_anneal(...)
#>   2. ├─finetune:::tune_sim_anneal.workflow(...)
#>   3. │ └─finetune:::tune_sim_anneal_workflow(...)
#>   4. │   ├─... %>% ...
#>   5. │   └─finetune:::new_in_neighborhood(...)
#>   6. │     └─finetune:::random_integer_neighbor(...)
#>   7. │       └─finetune:::sample_by_distance(...)
#>   8. │         └─finetune:::encode_set_backwards(candidates, pset)
#>   9. │           └─purrr::map2(pset$object, x, dials::encode_unit, direction = "backward")
#>  10. │             ├─dials (local) .f(.x[[1L]], .y[[1L]], ...)
#>  11. │             └─dials:::encode_unit.quant_param(.x[[1L]], .y[[1L]], ...) at dials/R/encode_unit.R:23:2
#>  12. │               └─rlang::abort("Values should be on [0, 1].", .internal = TRUE) at dials/R/encode_unit.R:50:6
#>  13. └─dplyr::mutate(., .config = paste0("iter", i), .parent = current_parent)
#> ✖ Optimization stopped prematurely; returning current results.
#> # Tuning results
#> # 2-fold cross-validation 
#> # A tibble: 2 × 5
#>   splits            id    .metrics         .notes           .iter
#>   <list>            <chr> <list>           <list>           <int>
#> 1 <split [402/402]> Fold1 <tibble [6 × 5]> <tibble [0 × 3]>     0
#> 2 <split [402/402]> Fold2 <tibble [6 × 5]> <tibble [0 × 3]>     0

Created on 2022-11-04 with reprex v2.0.2

Tadge-Analytics commented 1 year ago

Thanks heaps @hfrick , I'm looking forward to trying this suggestion out... and yes, I suspect it will work correctly with that change you recommended. Looking back now, I can't remember why I use the latter finalise method instead of the former. I probably copy and pasted some code.

Tadge-Analytics commented 1 year ago

Thanks for that @hfrick , I think my error was even simpler. Basically, I was running off the following blog post: https://uliniemann.com/blog/2022-07-04-comparing-hyperparameter-tuning-strategies-with-tidymodels/

Inside I saw the following:

image

What I really needed to do was the following:

image

and then: image

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.