Closed ttzhou closed 4 years ago
To add - if the parameters are specified via a list (rather than passing the workflow to parameters
), then the order is maintained (i.e. the first call to grid_regular
in the reprex above) now works as intended.
A named vector would be preferable to a list and should achieve the same goal.
A named vector would be preferable to a list and should achieve the same goal.
Sorry, my comment was unclear. I meant:
bt_params <- list(
mtry = mtry(c(1L, 1L)),
sample_size = sample_prop(c(1, 1)),
trees = trees(c(100L, 100L)),
tree_depth = tree_depth(c(2L, 2L)),
min_n = min_n(c(10L, 10L)),
loss_reduction = loss_reduction(c(0, 0), trans = NULL),
learn_rate = learn_rate(c(0.1, 0.2), trans = NULL)
) %>%
parameters %>%
finalize
rather than passing a list to the levels
arg...
First, that's what update()
does. Second, you would not put parameter values that are constant in the grid. Those would be fixed in the model specification (since they are not being tuned):
library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom 0.5.4 ✓ recipes 0.1.10
#> ✓ dials 0.0.6 ✓ rsample 0.0.6
#> ✓ dplyr 0.8.5 ✓ tibble 3.0.0
#> ✓ ggplot2 3.3.0 ✓ tune 0.1.0
#> ✓ infer 0.5.1 ✓ workflows 0.1.0
#> ✓ parsnip 0.0.5 ✓ yardstick 0.0.5
#> ✓ purrr 0.3.3
#> ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step() masks stats::step()
bt_eng <-
boost_tree("classification") %>%
set_engine("xgboost") %>%
set_args(
mtry = 1,
sample_size = 1,
trees = 100,
tree_depth = 2,
min_n = 10,
loss_reduction = 0,
learn_rate = tune()
)
bt_workflow <- workflow() %>%
add_model(bt_eng)
bt_params <- bt_workflow %>%
parameters() %>%
update(
learn_rate = learn_rate(c(0.1, 0.2), trans = NULL)
) %>%
grid_regular()
bt_params
#> # A tibble: 3 x 1
#> learn_rate
#> <dbl>
#> 1 0.1
#> 2 0.15
#> 3 0.2
Created on 2020-04-02 by the reprex package (v0.3.0)
Ah, I just chose a constant value for the example so we could see enough of the resulting grid. Also, I changed the issue name to align more closely to what I was trying to illustrate.
The use case I'm thinking of is constraining a fixed (non-random) number of values for each of the parameters to be tuned, where the fixed number could vary between parameters. e.g. 2 for mtry
and 3 for min_n
.
I could achieve this using expand.grid
, but it seems more in line with the tidymodels framework to use parameters
-> grid_regular
.
Here's a reprex with at least 2 values for each (using rand_forest
instead so we have less parameters):
library(tidymodels)
#> Registered S3 method overwritten by 'xts':
#> method from
#> as.zoo.xts zoo
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.0 ──
#> ✔ broom 0.5.4 ✔ recipes 0.1.9
#> ✔ dials 0.0.5 ✔ rsample 0.0.5
#> ✔ dplyr 0.8.5 ✔ tibble 3.0.0
#> ✔ ggplot2 3.2.1 ✔ tune 0.0.1
#> ✔ infer 0.5.1 ✔ workflows 0.1.0
#> ✔ parsnip 0.0.5 ✔ yardstick 0.0.4.9000
#> ✔ purrr 0.3.3
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ ggplot2::margin() masks dials::margin()
#> ✖ recipes::step() masks stats::step()
#> ✖ recipes::yj_trans() masks scales::yj_trans()
rf_eng <-
rand_forest("classification") %>%
set_engine("ranger") %>%
set_args(
trees = tune(),
mtry = tune(),
min_n = tune()
)
rf_workflow <- workflow() %>% add_model(rf_eng)
# Here is where I want to set a range of fixed parameters
# I used tune() above as a placeholder
rf_params <- rf_workflow %>%
parameters() %>%
update(trees = trees(c(100L, 200L))) %>%
update(mtry = mtry(c(1L, 5L))) %>%
update(min_n = min_n(c(10L, 20L))) %>%
finalize
rf_eng$args %>% names
#> [1] "mtry" "trees" "min_n"
# Here I'm matching the order of the
# entries in `levels` arg to that
# of what was set in update above, not
# what matches the order of the args
# when running `rf_eng$args %>% names`
rf_params %>%
grid_regular(
levels = c(
trees = 3, # supposed to represent trees
mtry = 2, # supposed to represent mtry
min_n = 2 # supposed to represent min_n
)
)
#> # A tibble: 12 x 3
#> mtry trees min_n
#> <int> <int> <int>
#> 1 1 100 10
#> 2 3 100 10
#> 3 5 100 10
#> 4 1 200 10
#> 5 3 200 10
#> 6 5 200 10
#> 7 1 100 20
#> 8 3 100 20
#> 9 5 100 20
#> 10 1 200 20
#> 11 3 200 20
#> 12 5 200 20
rf_params %>%
grid_regular(
levels = c(
2, # mtry
3, # trees
2 # min_n
)
)
#> # A tibble: 12 x 3
#> mtry trees min_n
#> <int> <int> <int>
#> 1 1 100 10
#> 2 5 100 10
#> 3 1 150 10
#> 4 5 150 10
#> 5 1 200 10
#> 6 5 200 10
#> 7 1 100 20
#> 8 5 100 20
#> 9 1 150 20
#> 10 5 150 20
#> 11 1 200 20
#> 12 5 200 20
The first call only has two levels for trees
, even though I was (intending) for it have 3, as seen via the levels
vector passed (I did it as a named vector, just for clarity sake).
It's not a bug. It's just confusing (to me), if:
parameters
object to grid_regular
via engine
-> set_args
-> parameters
-> update
, instead of constructing it from a list from scratchlevels
,and the way grid_regular
interprets the levels
vector is not obvious as to which parameter each integer is corresponding to.
It's between a bug and feature. It assumes that level values are ordered the same as they are in the parameters object (which is not reasonable) but doesn't ask for a named vector (i.e. the names are ignored).
Either way, we'll make some changes so that it can take a named vector. Keep in mind that the names should be based on the id values. In your case those are the same as the parameter names but would be different when something like this is used:
rf_eng <-
rand_forest("classification") %>%
set_engine("ranger") %>%
set_args(
trees = tune("number of trees"),
mtry = tune(),
min_n = tune()
)
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
Reprex below:
Note that the output for the
grid_regular
does have two levels but they are duplicate rows, and that the arrangement of the columns corresponds tobt_eng$args %>% names
.If I pass the
levels
arg vector in the order matchingbt_eng$args %>% names
, then I get the expected grid (the last chunk of code).Would it make sense to have a note in the documentation about this?
Alternatively, allow passing of a namedvector to
levels
, with names indicating which parameter to assign the integer to...Thanks (and stay healthy).
EDIT: I took out the loading of
fastpipe
from the reprex; it makes no difference and just clutters more space.