tidymodels / parsnip

A tidy unified interface to models
https://parsnip.tidymodels.org
Other
564 stars 78 forks source link

Idea: move `check_args()` up into `new_model_spec()` #1098

Open EmilHvitfeldt opened 3 months ago

EmilHvitfeldt commented 3 months ago

Adding a call to check_args() on this line https://github.com/tidymodels/parsnip/blob/main/R/misc.R#L350 produces the following results. The only questionable use-case is the last one, which people might use.

Still looking at other places where this would fail

library(parsnip)
library(discrim)

discrim_linear(penalty = -2)
#> Error in `new_model_spec()`:
#> ! The amount of regularization, `penalty`, should be `>= 0`.

discrim_linear(penalty = tune()) %>% 
  set_args(penalty = 1)
#> Linear Discriminant Model Specification (classification)
#> 
#> Main Arguments:
#>   penalty = 1
#> 
#> Computational engine: MASS

discrim_linear(penalty = tune())
#> Linear Discriminant Model Specification (classification)
#> 
#> Main Arguments:
#>   penalty = tune()
#> 
#> Computational engine: MASS

discrim_linear(penalty = 1) %>% 
  set_args(penalty = -2)
#> Error in `new_model_spec()`:
#> ! The amount of regularization, `penalty`, should be `>= 0`.

discrim_linear(penalty = -2) %>% 
  set_args(penalty = 1)
#> Error in `new_model_spec()`:
#> ! The amount of regularization, `penalty`, should be `>= 0`.
EmilHvitfeldt commented 3 months ago

right now the only blocker we have found for this, is dependent on how we handle https://github.com/tidymodels/parsnip/issues/1099

simonpcouch commented 3 months ago

Very much agreed on the linked issue that we can do a better job checking arguments in check_args().

I think it's important to note here that the pattern in check_args() methods is to eval_tidy() each argument:

https://github.com/tidymodels/parsnip/blob/dc9a66e2424d2aed3d5c6663f1a997fdb5fac0ec/R/boost_tree.R#L167-L169

If we're able to eval_tidy() each argument to model specification functions as soon as they're inputted, then there's no reason we should be capturing them as quosures in the first place. The reason we capture them as quosures is so that we can evaluate them in an environment that we construct, the eval_env, at fit() time:

https://github.com/tidymodels/parsnip/blob/dc9a66e2424d2aed3d5c6663f1a997fdb5fac0ec/R/fit.R#L156-L158

This environment supports defining arguments in relation to the data to be fit()ted to, like:

library(parsnip)

boost_tree(mtry = ncol(.x()) / 2, mode = "regression") %>% fit(mpg ~ ., mtcars)
#> parsnip model object
#> 
#> ##### xgb.Booster
#> raw: 21.6 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 0.5, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 15, watchlist = x$watchlist, 
#>     verbose = 0, nthread = 1, objective = "reg:squarederror")
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "0.5", min_child_weight = "1", subsample = "1", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.evaluation.log()
#> # of features: 10 
#> niter: 15
#> nfeatures : 10 
#> evaluation_log:
#>      iter training_rmse
#>     <num>         <num>
#>         1    14.9313149
#>         2    10.9841618
#> ---                    
#>        14     0.6109443
#>        15     0.5196238

...where .x() is a data descriptor. As of now, check_args() knows that it's unable to evaluate such an expression:

boost_tree(mtry = ncol(.x()) / 2, mode = "regression") %>% check_args()
#> Error in `descr_env$.x()`:
#> ! Descriptor context not set

Created on 2024-04-03 with reprex v2.1.0

Calling check_args() will not be able to give the same output when called inside of new_model_spec() because it doesn't yet have access to the data--the fact that the behavior might differ between these two contexts, to me, signals that we ought not to check_args() until we have access to the data at fit().

There are definitely improvements to be made to argument checking, as in the loophole that you pointed out in the linked issue. I think that's a good place to direct focus for now. :)