tidymodels / dials

Tools for creating tuning parameter values
https://dials.tidymodels.org/
Other
113 stars 27 forks source link

FR: When `levels` arg (of `grid_regular`) is a vector, can documentation specify that the order of the vector elements implicitly corresponds to the order of the parameters in the parameter object being passed #105

Closed ttzhou closed 4 years ago

ttzhou commented 4 years ago

Reprex below:

library(tidymodels)
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.0 ──
#> ✔ broom     0.5.4          ✔ recipes   0.1.9     
#> ✔ dials     0.0.5          ✔ rsample   0.0.5     
#> ✔ dplyr     0.8.5          ✔ tibble    3.0.0     
#> ✔ ggplot2   3.2.1          ✔ tune      0.0.1     
#> ✔ infer     0.5.1          ✔ workflows 0.1.0     
#> ✔ parsnip   0.0.5          ✔ yardstick 0.0.4.9000
#> ✔ purrr     0.3.3
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard()    masks scales::discard()
#> ✖ dplyr::filter()     masks stats::filter()
#> ✖ dplyr::lag()        masks stats::lag()
#> ✖ ggplot2::margin()   masks dials::margin()
#> ✖ recipes::step()     masks stats::step()
#> ✖ recipes::yj_trans() masks scales::yj_trans()

bt_eng <-
  boost_tree("classification") %>%
  set_engine("xgboost") %>%
  set_args(
    mtry = tune(),
    sample_size = tune(),
    trees = tune(),
    tree_depth = tune(),
    min_n = tune(),
    loss_reduction = tune(),
    learn_rate = tune()
  )

bt_workflow <- workflow() %>%
  add_model(bt_eng)

bt_params <- bt_workflow %>%
  parameters() %>%
  update(
    mtry = mtry(c(1L, 1L)),
    sample_size = sample_prop(c(1, 1)),
    trees = trees(c(100L, 100L)),
    tree_depth = tree_depth(c(2L, 2L)),
    min_n = min_n(c(10L, 10L)),
    loss_reduction = loss_reduction(c(0, 0), trans = NULL),
    learn_rate = learn_rate(c(0.1, 0.2), trans = NULL)
  ) %>%
  finalize

bt_eng$args %>% names
#> [1] "mtry"           "trees"          "min_n"          "tree_depth"    
#> [5] "learn_rate"     "loss_reduction" "sample_size"

# Here I'm matching the order of the
# entries in `levels` arg to that
# of what was set in update above, not
# what matches the order of the args
# when running `bt_eng$args %>% names`
bt_params %>%
  grid_regular(
    levels = c(
      1, # supposed to represent mtry
      1, # supposed to represent sample_size
      1, # supposed to represent trees
      1, # supposed to represent tree_depth
      1, # supposed to represent min_n
      1, # supposed to represent loss_reduction
      2  # supposed to represent learn_rate
    )
  )
#> # A tibble: 2 x 7
#>    mtry trees min_n tree_depth learn_rate loss_reduction sample_size
#>   <int> <int> <int>      <int>      <dbl>          <dbl>       <dbl>
#> 1     1   100    10          2        0.1              0           1
#> 2     1   100    10          2        0.1              0           1

bt_params %>%
  grid_regular(
    levels = c(
      1, # mtry
      1, # trees
      1, # min_n
      1, # tree_depth
      2, # learn_rate
      1, # loss_reduction
      1  # sample_size
    )
  )
#> # A tibble: 2 x 7
#>    mtry trees min_n tree_depth learn_rate loss_reduction sample_size
#>   <int> <int> <int>      <int>      <dbl>          <dbl>       <dbl>
#> 1     1   100    10          2        0.1              0           1
#> 2     1   100    10          2        0.2              0           1

Note that the output for the grid_regular does have two levels but they are duplicate rows, and that the arrangement of the columns corresponds to bt_eng$args %>% names.

If I pass the levels arg vector in the order matching bt_eng$args %>% names, then I get the expected grid (the last chunk of code).

Would it make sense to have a note in the documentation about this?

Alternatively, allow passing of a namedvector to levels, with names indicating which parameter to assign the integer to...

Thanks (and stay healthy).

EDIT: I took out the loading of fastpipe from the reprex; it makes no difference and just clutters more space.

ttzhou commented 4 years ago

To add - if the parameters are specified via a list (rather than passing the workflow to parameters), then the order is maintained (i.e. the first call to grid_regular in the reprex above) now works as intended.

topepo commented 4 years ago

A named vector would be preferable to a list and should achieve the same goal.

ttzhou commented 4 years ago

A named vector would be preferable to a list and should achieve the same goal.

Sorry, my comment was unclear. I meant:

bt_params <- list(
    mtry = mtry(c(1L, 1L)),
    sample_size = sample_prop(c(1, 1)),
    trees = trees(c(100L, 100L)),
    tree_depth = tree_depth(c(2L, 2L)),
    min_n = min_n(c(10L, 10L)),
    loss_reduction = loss_reduction(c(0, 0), trans = NULL),
    learn_rate = learn_rate(c(0.1, 0.2), trans = NULL)
  ) %>%
  parameters %>%
  finalize

rather than passing a list to the levels arg...

topepo commented 4 years ago

First, that's what update() does. Second, you would not put parameter values that are constant in the grid. Those would be fixed in the model specification (since they are not being tuned):

library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.4      ✓ recipes   0.1.10
#> ✓ dials     0.0.6      ✓ rsample   0.0.6 
#> ✓ dplyr     0.8.5      ✓ tibble    3.0.0 
#> ✓ ggplot2   3.3.0      ✓ tune      0.1.0 
#> ✓ infer     0.5.1      ✓ workflows 0.1.0 
#> ✓ parsnip   0.0.5      ✓ yardstick 0.0.5 
#> ✓ purrr     0.3.3
#> ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()  masks scales::discard()
#> x dplyr::filter()   masks stats::filter()
#> x dplyr::lag()      masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step()   masks stats::step()
bt_eng <-
  boost_tree("classification") %>%
  set_engine("xgboost") %>%
  set_args(
    mtry = 1,
    sample_size = 1,
    trees = 100,
    tree_depth = 2,
    min_n = 10,
    loss_reduction = 0,
    learn_rate = tune()
  )

bt_workflow <- workflow() %>%
  add_model(bt_eng)

bt_params <- bt_workflow %>%
  parameters() %>%
  update(
    learn_rate = learn_rate(c(0.1, 0.2), trans = NULL)
  ) %>%
  grid_regular()
bt_params
#> # A tibble: 3 x 1
#>   learn_rate
#>        <dbl>
#> 1       0.1 
#> 2       0.15
#> 3       0.2

Created on 2020-04-02 by the reprex package (v0.3.0)

ttzhou commented 4 years ago

Ah, I just chose a constant value for the example so we could see enough of the resulting grid. Also, I changed the issue name to align more closely to what I was trying to illustrate.

The use case I'm thinking of is constraining a fixed (non-random) number of values for each of the parameters to be tuned, where the fixed number could vary between parameters. e.g. 2 for mtry and 3 for min_n.

I could achieve this using expand.grid, but it seems more in line with the tidymodels framework to use parameters -> grid_regular.

Here's a reprex with at least 2 values for each (using rand_forest instead so we have less parameters):

library(tidymodels)
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.0 ──
#> ✔ broom     0.5.4          ✔ recipes   0.1.9     
#> ✔ dials     0.0.5          ✔ rsample   0.0.5     
#> ✔ dplyr     0.8.5          ✔ tibble    3.0.0     
#> ✔ ggplot2   3.2.1          ✔ tune      0.0.1     
#> ✔ infer     0.5.1          ✔ workflows 0.1.0     
#> ✔ parsnip   0.0.5          ✔ yardstick 0.0.4.9000
#> ✔ purrr     0.3.3
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard()    masks scales::discard()
#> ✖ dplyr::filter()     masks stats::filter()
#> ✖ dplyr::lag()        masks stats::lag()
#> ✖ ggplot2::margin()   masks dials::margin()
#> ✖ recipes::step()     masks stats::step()
#> ✖ recipes::yj_trans() masks scales::yj_trans()

rf_eng <-
  rand_forest("classification") %>%
  set_engine("ranger") %>%
  set_args(
    trees = tune(),
    mtry = tune(),
    min_n = tune()
  )

rf_workflow <- workflow() %>% add_model(rf_eng)

# Here is where I want to set a range of fixed parameters
# I used tune() above as a placeholder
rf_params <- rf_workflow %>%
  parameters() %>%
  update(trees = trees(c(100L, 200L))) %>%
  update(mtry = mtry(c(1L, 5L))) %>%
  update(min_n = min_n(c(10L, 20L))) %>%
  finalize

rf_eng$args %>% names
#> [1] "mtry"  "trees" "min_n"

# Here I'm matching the order of the
# entries in `levels` arg to that
# of what was set in update above, not
# what matches the order of the args
# when running `rf_eng$args %>% names`
rf_params %>%
  grid_regular(
    levels = c(
      trees = 3, # supposed to represent trees
      mtry = 2, # supposed to represent mtry
      min_n = 2  # supposed to represent min_n
    )
  )
#> # A tibble: 12 x 3
#>     mtry trees min_n
#>    <int> <int> <int>
#>  1     1   100    10
#>  2     3   100    10
#>  3     5   100    10
#>  4     1   200    10
#>  5     3   200    10
#>  6     5   200    10
#>  7     1   100    20
#>  8     3   100    20
#>  9     5   100    20
#> 10     1   200    20
#> 11     3   200    20
#> 12     5   200    20

rf_params %>%
  grid_regular(
    levels = c(
      2, # mtry
      3, # trees
      2 # min_n
    )
  )
#> # A tibble: 12 x 3
#>     mtry trees min_n
#>    <int> <int> <int>
#>  1     1   100    10
#>  2     5   100    10
#>  3     1   150    10
#>  4     5   150    10
#>  5     1   200    10
#>  6     5   200    10
#>  7     1   100    20
#>  8     5   100    20
#>  9     1   150    20
#> 10     5   150    20
#> 11     1   200    20
#> 12     5   200    20

The first call only has two levels for trees, even though I was (intending) for it have 3, as seen via the levels vector passed (I did it as a named vector, just for clarity sake).

It's not a bug. It's just confusing (to me), if:

  1. I'm passing a parameters object to grid_regular via engine -> set_args -> parameters -> update, instead of constructing it from a list from scratch
  2. and I'm specifying a vector as the argument to levels,

and the way grid_regular interprets the levels vector is not obvious as to which parameter each integer is corresponding to.

topepo commented 4 years ago

It's between a bug and feature. It assumes that level values are ordered the same as they are in the parameters object (which is not reasonable) but doesn't ask for a named vector (i.e. the names are ignored).

Either way, we'll make some changes so that it can take a named vector. Keep in mind that the names should be based on the id values. In your case those are the same as the parameter names but would be different when something like this is used:

rf_eng <-
  rand_forest("classification") %>%
  set_engine("ranger") %>%
  set_args(
    trees = tune("number of trees"),
    mtry = tune(),
    min_n = tune()
  )
github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.