Closed topepo closed 4 years ago
Nice idea!
Current prototype has results like the one below. Suggestions are welcome.
I might make the grid code optional. I'm also going to add an option suggested by @juliasilge to put comments in that explain why we do things (like center/scale, make dummy variables etc).
I'll make a branch with this code in a few days.
> template_xgboost(Species ~ ., data = iris)
xgb_recipe <-
recipe(formula = Species ~ ., data = iris) %>%
step_zv(all_predictors())
xgb_model <-
boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(),
learn_rate = tune(), loss_reduction = tune(), sample_size = tune()) %>%
set_mode("classification") %>%
set_engine("xgboost")
xgb_workflow <-
workflows::workflow() %>%
workflows::add_recipe(xgb_recipe) %>%
workflows::add_model(xgb_model)
set.seed(62147)
xgb_tune <-
tune_grid(xgb_workflow, resamples = stop("add your rsample object"),
grid = 20)
> template_xgboost(Sepal.Length ~ ., data = iris)
xgb_recipe <-
recipe(formula = Sepal.Length ~ ., data = iris) %>%
step_zv(all_predictors()) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)
xgb_model <-
boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(),
learn_rate = tune(), loss_reduction = tune(), sample_size = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost")
xgb_workflow <-
workflows::workflow() %>%
workflows::add_recipe(xgb_recipe) %>%
workflows::add_model(xgb_model)
set.seed(82702)
xgb_tune <-
tune_grid(xgb_workflow, resamples = stop("add your rsample object"),
grid = 20)
Can you say why the grid code would be optional?
I didn't want to assume that they will go straight to tuning (but maybe that's not a good assumption)
So maybe some options like tune = TRUE
and verbose = TRUE
? And when tune = FALSE
it uses the model defaults, with no tuning?
Yes, I think that this is the plan 😄
There's a template
branch. No tests or code for comments yet but it is a start. The rlang
-y bits cause a huge number of false positives for global variables in R CMD check
🙄
Looks great so far! Two comments:
firstly it looks like there are some naming mismatches in the current version:
library(tune)
template_xgboost(Species ~ ., data = iris)
#> xgb_recipe <-
#> recipe(formula = Species ~ ., data = iris) %>%
#> step_zv(all_predictors())
#>
#> xgb_model <-
#> boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(),
#> learn_rate = tune(), loss_reduction = tune(), sample_size = tune()) %>%
#> set_mode("classification") %>%
#> set_engine("xgboost")
#>
#> xgb_wflw <-
#> workflow() %>%
#> add_recipe(xgb_rec) %>%
#> add_model(xgb_mod)
#>
#> set.seed(51254)
#> xgb_tune <-
#> tune_grid(xgb_workflow, resamples = stop("add your rsample object"),
#> grid = 20)
Created on 2020-02-13 by the reprex package (v0.3.0)
Where the recipe is called xgb_recipe
but when added to the workflow it is called xgb_rec
.
Secondly, a common mistake I still make is forgetting to turn the outcome variable into a factor. And it seems like this edge-case could be easily handled by adding a step_string2factor()
to the recipe if the output is a string.
In general, I would argue that the scaffolding should be fairly minimal but I feel this change would be a fairly easy pain-point to avoid without much controversy.
Good idea.
This may get ugly for data sets with a large number of string columns (but will be easier with an upcoming tidyselect
version) but:
library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────────────── tidymodels 0.0.4 ──
#> ✓ broom 0.5.4 ✓ recipes 0.1.9
#> ✓ dials 0.0.4 ✓ rsample 0.0.5.9000
#> ✓ dplyr 0.8.4 ✓ tibble 2.1.3
#> ✓ ggplot2 3.2.1 ✓ tune 0.0.1
#> ✓ infer 0.5.1 ✓ workflows 0.1.0
#> ✓ parsnip 0.0.5 ✓ yardstick 0.0.4.9000
#> ✓ purrr 0.3.3
#> ── Conflicts ──────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step() masks stats::step()
#> x recipes::yj_trans() masks scales::yj_trans()
library(tune)
data(hpc_data, package = "modeldata")
hpc_data_alt <-
hpc_data %>%
mutate_if(is.factor, as.character)
template_glmnet(class ~ ., data = hpc_data_alt, verbose = TRUE)
#> glmn_recipe <-
#> recipe(formula = class ~ ., data = hpc_data_alt) %>%
#> # For modeling, it is preferred to encode qualitative data as factors
#> # (instead of character).
#> step_string2factor(one_of(protocol, day, class)) %>%
#> step_novel(all_nominal(), -all_outcomes()) %>%
#> # This model requires the predictors to be numeric. The most common
#> # method to convert qualitative predictors to numeric is to create binary
#> # indicator variables (aka dummy variables) from these predictors.
#> # However, for this model, binary indicator variables can be made for
#> # each of the levels of the factors (known as 'one-hot encoding').
#> step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
#> # Regularization methods sum up functions of the model slope
#> # coefficients. Because of this, the predictor variables should be on the
#> # same scale. Before centering and scaling the numeric predictors, any
#> # predictors with a single unique value are filtered out.
#> step_zv(all_predictors()) %>%
#> step_normalize(all_predictors(), -all_nominal())
#>
#> glmn_model <-
#> multinom_reg(penalty = tune(), mixture = tune()) %>%
#> set_mode("classification") %>%
#> set_engine("glmnet")
#>
#> glmn_wflw <-
#> workflow() %>%
#> add_recipe(glmn_recipe) %>%
#> add_model(glmn_model)
#>
#> glmn_grid <- expand.grid(penalty = 10^seq(-6, -1, length.out = 20), mixture = c(0.05,
#> 0.2, 0.4, 0.6, 0.8, 1))
#>
#> glmn_tune <-
#> tune_grid(glmn_workflow, resamples = stop("add your rsample object"), grid = glmn_grid)
Created on 2020-02-13 by the reprex package (v0.3.0)
I love all the comments you included! The last line above refers to "glmn_workflow" while it's definition names it "glmn_wflw". I'm a fan of writing things out in reference material, so the longer one is my preference. Similarly, when referring to kappa, I would spell it out rather than abbreviate it "kap". I work with beginning grad students a lot & all the little things we can do to lessen the mental load add up.
I think this is an excellent idea! I do agree that a grid shouldn't be included or at the very least, optional. One suggestion would be to include a 5-fold CV rsample chunk and a fit_resamples()
. This would enable quickly getting a somewhat "honest" base estimate of performance of the model. After which they can go into tuning hyper params etc.
@dilsherdhillon The current code has an option for the grid.
For me there is too much bias in 5-fold. Anyway, it is easy for people just to plug in whatever resampling object that they have as it stands now.
This is my suggestion. Addin that allows user to select data, outcome, mode, model, engine, tuning params, resample type, grid type and metrics, generate and insert code. In this version, recipe part is very simple, I need step combinations for each model. We can also add more details and options.
Code from demo:
set.seed(42)
mtcars_split <- initial_split(mtcars, prop = 0.75)
mtcars_train <- training(mtcars_split)
mtcars_test <- testing(mtcars_split)
mtcars_recipe <-
recipe(formula = mpg ~ ., data = mtcars_train)
mtcars_model <-
boost_tree(
mtry = tune(),
trees = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune()
) %>%
set_mode("regression") %>%
set_engine("xgboost")
mtcars_wflow <-
workflow() %>%
add_model(mtcars_model) %>%
add_recipe(mtcars_recipe)
mtcars_params <-
mtcars_wflow %>%
parameters() %>%
update(
`mtry` = finalize(mtry(), x = mtcars_train),
`trees` = trees(c(1, 2000)),
`min_n` = min_n(c(2, 40)),
`tree_depth` = tree_depth(c(1, 15)),
`learn_rate` = learn_rate(c(-10, -1)),
`loss_reduction` = loss_reduction(c(-10, 1.5)),
`sample_size` = sample_size(c(0.1, 1))
)
mtcars_resamples <-
vfold_cv(
mtcars,
v = 10,
repeats = 1
)
mtcars_grid <-
grid_latin_hypercube(
mtcars_params,
size = 3,
original = TRUE
)
mtcars_metrics <-
yardstick::metric_set(
rmse,
rsq,
mae
)
mtcars_grid_form <-
tune_grid(
mtcars_wflow,
resamples = mtcars_resamples,
grid = mtcars_grid,
metrics = mtcars_metrics,
control = control_grid(verbose = FALSE)
)
I think that GUI is a great way to get started, especially for students just getting to know R.
Now in the usemodels
package
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
We should have some
suggest()
functions that, given some data information, will print out a scaffold for recipe and model object definitions.For example,
suggest_glmnet(Sepal.Length ~ ., data = iris)
might print to the console: