syntax recommender functions

topepo commented 4 years ago

We should have some suggest() functions that, given some data information, will print out a scaffold for recipe and model object definitions.

For example, suggest_glmnet(Sepal.Length ~ ., data = iris) might print to the console:

rec <-
  recipe(Sepal.Length ~ ., data = iris) %>%
  step_novel(all_nominal(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% 
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors())

mod <- 
  parsnip::linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")

wflw <- 
  workflow() %>% 
  add_model(mod) %>% 
  add_recipe(rec)

BobMuenchen commented 4 years ago

Nice idea!

topepo commented 4 years ago

Current prototype has results like the one below. Suggestions are welcome.

I might make the grid code optional. I'm also going to add an option suggested by @juliasilge to put comments in that explain why we do things (like center/scale, make dummy variables etc).

I'll make a branch with this code in a few days.

> template_xgboost(Species ~ ., data = iris)
xgb_recipe <- 
  recipe(formula = Species ~ ., data = iris) %>% 
  step_zv(all_predictors()) 

xgb_model <- 
  boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), 
    learn_rate = tune(), loss_reduction = tune(), sample_size = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("xgboost") 

xgb_workflow <-
  workflows::workflow() %>%
  workflows::add_recipe(xgb_recipe) %>%
  workflows::add_model(xgb_model) 

set.seed(62147)
xgb_tune <- 
  tune_grid(xgb_workflow, resamples = stop("add your rsample object"), 
    grid = 20) 

> template_xgboost(Sepal.Length ~ ., data = iris)
xgb_recipe <- 
  recipe(formula = Sepal.Length ~ ., data = iris) %>% 
  step_zv(all_predictors()) %>% 
  step_novel(all_nominal(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) 

xgb_model <- 
  boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), 
    learn_rate = tune(), loss_reduction = tune(), sample_size = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

xgb_workflow <-
  workflows::workflow() %>%
  workflows::add_recipe(xgb_recipe) %>%
  workflows::add_model(xgb_model) 

set.seed(82702)
xgb_tune <- 
  tune_grid(xgb_workflow, resamples = stop("add your rsample object"), 
    grid = 20)

juliasilge commented 4 years ago

Can you say why the grid code would be optional?

topepo commented 4 years ago

I didn't want to assume that they will go straight to tuning (but maybe that's not a good assumption)

juliasilge commented 4 years ago

So maybe some options like tune = TRUE and verbose = TRUE? And when tune = FALSE it uses the model defaults, with no tuning?

topepo commented 4 years ago

Yes, I think that this is the plan 😄

topepo commented 4 years ago

There's a template branch. No tests or code for comments yet but it is a start. The rlang-y bits cause a huge number of false positives for global variables in R CMD check 🙄

EmilHvitfeldt commented 4 years ago

Looks great so far! Two comments:

firstly it looks like there are some naming mismatches in the current version:

library(tune)

template_xgboost(Species ~ ., data = iris)
#> xgb_recipe <- 
#>   recipe(formula = Species ~ ., data = iris) %>% 
#>   step_zv(all_predictors()) 
#> 
#> xgb_model <- 
#>   boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), 
#>     learn_rate = tune(), loss_reduction = tune(), sample_size = tune()) %>% 
#>   set_mode("classification") %>% 
#>   set_engine("xgboost") 
#> 
#> xgb_wflw <- 
#>   workflow() %>% 
#>   add_recipe(xgb_rec) %>% 
#>   add_model(xgb_mod) 
#> 
#> set.seed(51254)
#> xgb_tune <-
#>   tune_grid(xgb_workflow, resamples = stop("add your rsample object"), 
#>     grid = 20)

^{Created on 2020-02-13 by the reprex package (v0.3.0)}

Where the recipe is called xgb_recipe but when added to the workflow it is called xgb_rec.

Secondly, a common mistake I still make is forgetting to turn the outcome variable into a factor. And it seems like this edge-case could be easily handled by adding a step_string2factor() to the recipe if the output is a string.

In general, I would argue that the scaffolding should be fairly minimal but I feel this change would be a fairly easy pain-point to avoid without much controversy.

topepo commented 4 years ago

Good idea.

This may get ugly for data sets with a large number of string columns (but will be easier with an upcoming tidyselect version) but:

library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────────────── tidymodels 0.0.4 ──
#> ✓ broom     0.5.4          ✓ recipes   0.1.9     
#> ✓ dials     0.0.4          ✓ rsample   0.0.5.9000
#> ✓ dplyr     0.8.4          ✓ tibble    2.1.3     
#> ✓ ggplot2   3.2.1          ✓ tune      0.0.1     
#> ✓ infer     0.5.1          ✓ workflows 0.1.0     
#> ✓ parsnip   0.0.5          ✓ yardstick 0.0.4.9000
#> ✓ purrr     0.3.3
#> ── Conflicts ──────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()    masks scales::discard()
#> x dplyr::filter()     masks stats::filter()
#> x dplyr::lag()        masks stats::lag()
#> x ggplot2::margin()   masks dials::margin()
#> x recipes::step()     masks stats::step()
#> x recipes::yj_trans() masks scales::yj_trans()
library(tune)

data(hpc_data, package = "modeldata")

hpc_data_alt <- 
  hpc_data %>% 
  mutate_if(is.factor, as.character)

template_glmnet(class ~ ., data = hpc_data_alt, verbose = TRUE)
#> glmn_recipe <- 
#>   recipe(formula = class ~ ., data = hpc_data_alt) %>% 
#>   # For modeling, it is preferred to encode qualitative data as factors 
#>   # (instead of character). 
#>   step_string2factor(one_of(protocol, day, class)) %>% 
#>   step_novel(all_nominal(), -all_outcomes()) %>% 
#>   # This model requires the predictors to be numeric. The most common 
#>   # method to convert qualitative predictors to numeric is to create binary 
#>   # indicator variables (aka dummy variables) from these predictors. 
#>   # However, for this model, binary indicator variables can be made for 
#>   # each of the levels of the factors (known as 'one-hot encoding'). 
#>   step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% 
#>   # Regularization methods sum up functions of the model slope 
#>   # coefficients. Because of this, the predictor variables should be on the 
#>   # same scale. Before centering and scaling the numeric predictors, any 
#>   # predictors with a single unique value are filtered out. 
#>   step_zv(all_predictors()) %>% 
#>   step_normalize(all_predictors(), -all_nominal()) 
#> 
#> glmn_model <- 
#>   multinom_reg(penalty = tune(), mixture = tune()) %>% 
#>   set_mode("classification") %>% 
#>   set_engine("glmnet") 
#> 
#> glmn_wflw <- 
#>   workflow() %>% 
#>   add_recipe(glmn_recipe) %>% 
#>   add_model(glmn_model) 
#> 
#> glmn_grid <- expand.grid(penalty = 10^seq(-6, -1, length.out = 20), mixture = c(0.05, 
#>     0.2, 0.4, 0.6, 0.8, 1)) 
#> 
#> glmn_tune <- 
#>   tune_grid(glmn_workflow, resamples = stop("add your rsample object"), grid = glmn_grid)

^{Created on 2020-02-13 by the reprex package (v0.3.0)}

BobMuenchen commented 4 years ago

I love all the comments you included! The last line above refers to "glmn_workflow" while it's definition names it "glmn_wflw". I'm a fan of writing things out in reference material, so the longer one is my preference. Similarly, when referring to kappa, I would spell it out rather than abbreviate it "kap". I work with beginning grad students a lot & all the little things we can do to lessen the mental load add up.

dilsherdhillon commented 4 years ago

I think this is an excellent idea! I do agree that a grid shouldn't be included or at the very least, optional. One suggestion would be to include a 5-fold CV rsample chunk and a fit_resamples(). This would enable quickly getting a somewhat "honest" base estimate of performance of the model. After which they can go into tuning hyper params etc.

topepo commented 4 years ago

@dilsherdhillon The current code has an option for the grid.

For me there is too much bias in 5-fold. Anyway, it is easy for people just to plug in whatever resampling object that they have as it stands now.

milosvil commented 4 years ago

This is my suggestion. Addin that allows user to select data, outcome, mode, model, engine, tuning params, resample type, grid type and metrics, generate and insert code. In this version, recipe part is very simple, I need step combinations for each model. We can also add more details and options.

template_rec

Code from demo:

set.seed(42)
mtcars_split <- initial_split(mtcars, prop = 0.75)
mtcars_train <- training(mtcars_split)
mtcars_test <- testing(mtcars_split)

mtcars_recipe <-
  recipe(formula = mpg ~ ., data = mtcars_train)

mtcars_model <-
  boost_tree(
    mtry = tune(),
    trees = tune(),
    min_n = tune(),
    tree_depth = tune(),
    learn_rate = tune(),
    loss_reduction = tune(),
    sample_size = tune()
  ) %>%
  set_mode("regression") %>%
  set_engine("xgboost")

mtcars_wflow <-
  workflow() %>%
  add_model(mtcars_model) %>%
  add_recipe(mtcars_recipe)

mtcars_params <-
  mtcars_wflow %>%
  parameters() %>%
  update(
    `mtry` = finalize(mtry(), x = mtcars_train),
    `trees` = trees(c(1, 2000)),
    `min_n` = min_n(c(2, 40)),
    `tree_depth` = tree_depth(c(1, 15)),
    `learn_rate` = learn_rate(c(-10, -1)),
    `loss_reduction` = loss_reduction(c(-10, 1.5)),
    `sample_size` = sample_size(c(0.1, 1))
  )

mtcars_resamples <-
  vfold_cv(
    mtcars,
    v = 10,
    repeats = 1
  )

mtcars_grid <-
  grid_latin_hypercube(
    mtcars_params,
    size = 3,
    original = TRUE
  )

mtcars_metrics <-
  yardstick::metric_set(
    rmse,
    rsq,
    mae
  )

mtcars_grid_form <-
  tune_grid(
    mtcars_wflow,
    resamples = mtcars_resamples,
    grid = mtcars_grid,
    metrics = mtcars_metrics,
    control = control_grid(verbose = FALSE)
  )

BobMuenchen commented 4 years ago

I think that GUI is a great way to get started, especially for students just getting to know R.

topepo commented 4 years ago

Now in the usemodels package

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / tune

syntax recommender functions #167