Adding an engine argument to model_spec functions (with default)

tidymodels / parsnip

A tidy unified interface to models

https://parsnip.tidymodels.org

Other

600 stars 89 forks source link

Adding an engine argument to model_spec functions (with default) #513

Closed dgrtwo closed 3 years ago

dgrtwo commented 3 years ago

Could you consider having an engine = argument to each model_spec function, rather than requiring set_engine() afterward? For instance:

# Instead of requiring:
logistic_reg() %>%
  set_engine("glm")

# Allow
logistic_reg(engine = "glm")

This is more compact, particularly since it becomes more straightforward to pass a model spec as an argument (e.g. to set_model()) without giving it its own two lines of code for assignment. (Besides being shorter, not having to create an object means not having to give it a name like lin_mod that might need to change in two places if you change the model spec).

I'd also find it a more natural workflow if these engines had defaults, like "lm" for linear_reg(). parsnip is already opinionated about the model if set_engine() is not called, like choosing "ranger" for rand_forest(), though it gives a warning about that. (Some like nearest_neighbor() or prophet() have only one engine implemented, but still give the warning, which seems especially harsh).

None of this precludes set_engine() still keeping its current functionality for the sake of reverse compatibility, ability to modify an existing recipe. and setting extra arguments (Similarly, in ggplot2 you can do either ggplot(data, aes(...)) or ggplot(data) + aes(...)).

topepo commented 3 years ago

I'll start a branch for this to test out. I like it but there are some additional things that we need to think about.

For example, there are going to be model definitions in parsnip where all of the engines are in separate packages. We are about to add generalized additive models and that is probably how it will happen.

In that case, we could default engine = "mgvc" without error in the model function but almost everything else would error (like translate()) without the other package loaded. So, we'll need to see if that type of model function should not have a default or error trap the supporting functions well (again, if the other package is not loaded).

I very much like the idea of a one-line path to an lm() model spec (and so on). I'd love to be able to let the data assign the model mode, but we tried really hard to do that and it can't happen (we need the code temp[late before model fit time and that is mode-specific).

dgrtwo commented 3 years ago

So, we'll need to see if that type of model function should not have a default or error trap the supporting functions well (again, if the other package is not loaded).

Is there a reason not to throw an error immediately?

I'm not sure this presents a different problem than the set_engine() syntax (they could still do set_engine("mgvc"), and then run into errors later). Yes it's explicit rather than setting a default, but just because the user types set_engine("mgvc") doesn't mean they know that this relies on an external dependency and confirmed that they had the package installed.

I actually don't think I understand why linear_reg() %>% set_engine("stan") doesn't throw an error if "rstanarm" is not installed. Is the theory that they could install it in between setting the engine and applying it? Or that different packages could be used for training vs predicting (is that common)?

topepo commented 3 years ago

Is there a reason not to throw an error immediately?

We could defer errors so that every function could have a default engine. We would throw errors when the missing data about the engine is absent (see next item)

I'm not sure this presents a different problem than the set_engine() syntax (they could still do set_engine("mgvc"), and then run into errors later).

Things like translate() need to lookup information about the model/engine comb. If there is no data there, then it fails. We should do it gracefully.

I actually don't think I understand why linear_reg() %>% set_engine("stan") doesn't throw an error if "rstanarm" is not installed.

That package is needed only at fit-time. We try to avoid having links to any links to code from an engine dependency (rlang::call2() is 🔥 ).

There is an S3 method that lists the required packages for each model, but it is not invoked until the last minute.

dgrtwo commented 3 years ago

Things like translate() need to lookup information about the model/engine comb. If there is no data there, then it fails. We should do it gracefully.

Is that any different than if someone forgets the engine, e,g, gam() %>% fit(...) without set_engine("mgcv")?

Or, for that matter, if they do set_engine("mgcv") when they don't have it installed?

topepo commented 3 years ago

It's even more fault tolerant than I thought it was:

# remove.packages("kernlab")
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

# S'all good
mod <- 
  svm_rbf(cost = 1) %>% 
  set_mode("regression") %>% 
  set_engine("kernlab")

translate(mod)
#> Radial Basis Function Support Vector Machine Specification (regression)
#> 
#> Main Arguments:
#>   cost = 1
#> 
#> Computational engine: kernlab 
#> 
#> Model fit template:
#> kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = 1, 
#>     kernel = "rbfdot")

update(mod, cost = 2)
#> Radial Basis Function Support Vector Machine Specification (regression)
#> 
#> Main Arguments:
#>   cost = 2
#> 
#> Computational engine: kernlab

# Nope
mod %>% fit(mpg ~ ., data = mtcars)
#> Error: This engine requires some package installs: 'kernlab'

^{Created on 2021-06-14 by the reprex package (v2.0.0)}

Even with no set_engine():

# pak::pak("kernlab")
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

svm_rbf(cost = 1) %>% 
  set_mode("regression") %>% 
  fit(mpg ~ ., data = mtcars)
#> Warning: Engine set to `kernlab`.
#> parsnip model object
#> 
#> Fit time:  11ms 
#> Support Vector Machine object of class "ksvm" 
#> 
#> SV type: eps-svr  (regression) 
#>  parameter : epsilon = 0.1  cost C = 1 
#> 
#> Gaussian Radial Basis kernel function. 
#>  Hyperparameter : sigma =  0.0783677791883467 
#> 
#> Number of Support Vectors : 29 
#> 
#> Objective Function Value : -8.4001 
#> Training error : 0.132159

^{Created on 2021-06-14 by the reprex package (v2.0.0)}

juliasilge commented 3 years ago

Closed in #515

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.