Open simonpcouch opened 2 months ago
Ultimately, recipes::juice()
and hardhat::recompose()
are what handles running recipes preprocessors via fit.action_recipe()
-> mold()
-> run_mold()
-> mold_recipe_default_process()
, and they can avoid model.matrix()
while they do so because recipes don't allow for tricky formulae. Carving out a path in fit.action_formula()
to detect sparse tibbles and then go through a different preprocessor code path (recipe or variables) would only be possible if the formula were to pass recipes:::inline_check()
.
In order of most to least preferred, my thoughts on approaches we could take for fit()
ting workflows with formula preprocessors and sparse tibbles:
1) Maybe overopinionated: Warn on add_formula()
with a sparse tibble, recommending that people use add_recipe()
instead to preserve sparsity. This is a relatively easy switch in most cases: least convenient when people need to translate some tricky formulae into recipe steps, which isn't too bad.
2) Inconsistent, somewhat hacky: Carve out a path in fit.action_formula()
to detect sparse tibbles and then go through a recipes code path if the formula passes recipes:::inline_check()
, otherwise reverting to 1) or 3) or erroring.
3) High-effort and bug-prone: Re-write a version of model.matrix()
that can handle all the same kinds of tricky formulae, but that can do so with sparse vectors.
Quick note: if you install the latest dev version of {sparsevctrs} and run withr::local_options("sparsevctrs.verbose_materialize" = 3)
you now get an error if a sparse vector is forced materialized by something other that {sparsevctrs} itself.
Notes from chatting with Emil on this:
parsnip experiences the same issue with its fit()
interface: the sparse data is passed to model.matrix()
, which is somewhat slow but, even more importantly, introduces a substantial slowdown with many engines when dense data would have been better represented as sparse. So, we need to address the analogous problem in parsnip.
Approach 1) (with 3) in the long term) is adaptable to both situations and is relatively future-proof. We would raise a condition if users pass sparse data to fit.model_spec()
(as in, with a formula) to tell them to transition to fit_xy()
if they'd like to make use of sparsity.
Users will only pass fit.workflow(data)
or fit.model_spec(data)
as sparse data if they do so intentionally; this is "opt-in," and tidymodels doesn't do so by itself. If recipes, at some point in the future, introduces default behavior to coerce to sparsity in smart situations, users are already in the happy path, because they're using a recipe preprocessor with a workflow, and workflows always uses parsnip's fit_xy()
interface (unless users have a model formula, in which case... there's a tricky formula and model.matrix()
will be needed!).
Emil and I chatted and are on the same page here!
I originally proposed in 1) that we would warn in that case. I think a more tidyverse-style-esque approach would be to error and then reference an option (or control_*()
argument) that users can set to say "I know I opted in to use sparse data but am choosing an interface that will coerce to dense and see a slowdown. Let me do it!"
Related to #239—just a place to keep notes on the thought process for supporting sparse tibbles with formula preprocessors. In #245, we see:
Created on 2024-09-13 with reprex v2.1.1
In the formula preprocessor
fit()
evaluation, the data type conversions don't actually take a ton of time:It's just that, with
add_formula()
,parsnip::xgb_train(x)
is a matrix, whereas it's adgCMatrix
when passed withadd_recipe()
, and xgboost is much slower when data that ought to be sparse is dense.