Closed dgrtwo closed 3 years ago
I'll start a branch for this to test out. I like it but there are some additional things that we need to think about.
For example, there are going to be model definitions in parsnip
where all of the engines are in separate packages. We are about to add generalized additive models and that is probably how it will happen.
In that case, we could default engine = "mgvc"
without error in the model function but almost everything else would error (like translate()
) without the other package loaded. So, we'll need to see if that type of model function should not have a default or error trap the supporting functions well (again, if the other package is not loaded).
I very much like the idea of a one-line path to an lm()
model spec (and so on). I'd love to be able to let the data assign the model mode, but we tried really hard to do that and it can't happen (we need the code temp[late before model fit time and that is mode-specific).
So, we'll need to see if that type of model function should not have a default or error trap the supporting functions well (again, if the other package is not loaded).
Is there a reason not to throw an error immediately?
I'm not sure this presents a different problem than the set_engine()
syntax (they could still do set_engine("mgvc")
, and then run into errors later). Yes it's explicit rather than setting a default, but just because the user types set_engine("mgvc")
doesn't mean they know that this relies on an external dependency and confirmed that they had the package installed.
I actually don't think I understand why linear_reg() %>% set_engine("stan")
doesn't throw an error if "rstanarm" is not installed. Is the theory that they could install it in between setting the engine and applying it? Or that different packages could be used for training vs predicting (is that common)?
Is there a reason not to throw an error immediately?
We could defer errors so that every function could have a default engine. We would throw errors when the missing data about the engine is absent (see next item)
I'm not sure this presents a different problem than the set_engine() syntax (they could still do
set_engine("mgvc")
, and then run into errors later).
Things like translate()
need to lookup information about the model/engine comb. If there is no data there, then it fails. We should do it gracefully.
I actually don't think I understand why
linear_reg() %>% set_engine("stan")
doesn't throw an error if "rstanarm" is not installed.
That package is needed only at fit-time. We try to avoid having links to any links to code from an engine dependency (rlang::call2()
is 🔥 ).
There is an S3 method that lists the required packages for each model, but it is not invoked until the last minute.
Things like translate() need to lookup information about the model/engine comb. If there is no data there, then it fails. We should do it gracefully.
Is that any different than if someone forgets the engine, e,g, gam() %>% fit(...)
without set_engine("mgcv")
?
Or, for that matter, if they do set_engine("mgcv")
when they don't have it installed?
It's even more fault tolerant than I thought it was:
# remove.packages("kernlab")
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
# S'all good
mod <-
svm_rbf(cost = 1) %>%
set_mode("regression") %>%
set_engine("kernlab")
translate(mod)
#> Radial Basis Function Support Vector Machine Specification (regression)
#>
#> Main Arguments:
#> cost = 1
#>
#> Computational engine: kernlab
#>
#> Model fit template:
#> kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = 1,
#> kernel = "rbfdot")
update(mod, cost = 2)
#> Radial Basis Function Support Vector Machine Specification (regression)
#>
#> Main Arguments:
#> cost = 2
#>
#> Computational engine: kernlab
# Nope
mod %>% fit(mpg ~ ., data = mtcars)
#> Error: This engine requires some package installs: 'kernlab'
Created on 2021-06-14 by the reprex package (v2.0.0)
Even with no set_engine()
:
# pak::pak("kernlab")
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
svm_rbf(cost = 1) %>%
set_mode("regression") %>%
fit(mpg ~ ., data = mtcars)
#> Warning: Engine set to `kernlab`.
#> parsnip model object
#>
#> Fit time: 11ms
#> Support Vector Machine object of class "ksvm"
#>
#> SV type: eps-svr (regression)
#> parameter : epsilon = 0.1 cost C = 1
#>
#> Gaussian Radial Basis kernel function.
#> Hyperparameter : sigma = 0.0783677791883467
#>
#> Number of Support Vectors : 29
#>
#> Objective Function Value : -8.4001
#> Training error : 0.132159
Created on 2021-06-14 by the reprex package (v2.0.0)
Closed in #515
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
Could you consider having an
engine =
argument to each model_spec function, rather than requiringset_engine()
afterward? For instance:This is more compact, particularly since it becomes more straightforward to pass a model spec as an argument (e.g. to
set_model()
) without giving it its own two lines of code for assignment. (Besides being shorter, not having to create an object means not having to give it a name likelin_mod
that might need to change in two places if you change the model spec).I'd also find it a more natural workflow if these engines had defaults, like
"lm"
forlinear_reg()
. parsnip is already opinionated about the model ifset_engine()
is not called, like choosing"ranger"
forrand_forest()
, though it gives a warning about that. (Some likenearest_neighbor()
orprophet()
have only one engine implemented, but still give the warning, which seems especially harsh).None of this precludes
set_engine()
still keeping its current functionality for the sake of reverse compatibility, ability to modify an existing recipe. and setting extra arguments (Similarly, in ggplot2 you can do eitherggplot(data, aes(...))
orggplot(data) + aes(...)
).