address `fit()` slowdown with sparse tibble and formula preprocessor

simonpcouch commented 2 months ago

Related to #239—just a place to keep notes on the thought process for supporting sparse tibbles with formula preprocessors. In #245, we see:

library(tidymodels)

sparse_hotel_rates <- function() {
  # 99.2 sparsity
  hotel_rates <- modeldata::hotel_rates

  prefix_colnames <- function(x, prefix) {
    colnames(x) <- paste(colnames(x), prefix, sep = "_")
    x
  }

  dummies_country <- hardhat::fct_encode_one_hot(hotel_rates$country)
  dummies_company <- hardhat::fct_encode_one_hot(hotel_rates$company)
  dummies_agent <- hardhat::fct_encode_one_hot(hotel_rates$agent)

  res <- cbind(
    hotel_rates["avg_price_per_room"],
    prefix_colnames(dummies_country, "country"),
    prefix_colnames(dummies_company, "company"),
    prefix_colnames(dummies_agent, "agent")
  )

  res <- as.matrix(res)
  Matrix::Matrix(res, sparse = TRUE)
}

hotel_data <- sparse_hotel_rates()
hotel_data <- sparsevctrs::coerce_to_sparse_tibble(hotel_data)

spec <- boost_tree() %>%
  set_mode("regression") %>%
  set_engine("xgboost")

form <- avg_price_per_room ~ .
rec <- recipe(form, data = hotel_data)

wflow <- workflow(spec = spec)

system.time({fit(wflow %>% add_recipe(rec), hotel_data)})
#>    user  system elapsed 
#>   0.255   0.014   0.269
system.time({fit(wflow %>% add_formula(form), hotel_data)})
#>    user  system elapsed 
#>   3.847   0.039   3.905

^{Created on 2024-09-13 with reprex v2.1.1}

In the formula preprocessor fit() evaluation, the data type conversions don't actually take a ton of time:

It's just that, with add_formula(), parsnip::xgb_train(x) is a matrix, whereas it's a dgCMatrix when passed with add_recipe(), and xgboost is much slower when data that ought to be sparse is dense.

simonpcouch commented 2 months ago

Ultimately, recipes::juice() and hardhat::recompose() are what handles running recipes preprocessors via fit.action_recipe() -> mold() -> run_mold() -> mold_recipe_default_process(), and they can avoid model.matrix() while they do so because recipes don't allow for tricky formulae. Carving out a path in fit.action_formula() to detect sparse tibbles and then go through a different preprocessor code path (recipe or variables) would only be possible if the formula were to pass recipes:::inline_check().

simonpcouch commented 2 months ago

In order of most to least preferred, my thoughts on approaches we could take for fit()ting workflows with formula preprocessors and sparse tibbles:

1) Maybe overopinionated: Warn on add_formula() with a sparse tibble, recommending that people use add_recipe() instead to preserve sparsity. This is a relatively easy switch in most cases: least convenient when people need to translate some tricky formulae into recipe steps, which isn't too bad. 2) Inconsistent, somewhat hacky: Carve out a path in fit.action_formula() to detect sparse tibbles and then go through a recipes code path if the formula passes recipes:::inline_check(), otherwise reverting to 1) or 3) or erroring. 3) High-effort and bug-prone: Re-write a version of model.matrix() that can handle all the same kinds of tricky formulae, but that can do so with sparse vectors.

EmilHvitfeldt commented 2 months ago

Quick note: if you install the latest dev version of {sparsevctrs} and run withr::local_options("sparsevctrs.verbose_materialize" = 3) you now get an error if a sparse vector is forced materialized by something other that {sparsevctrs} itself.

simonpcouch commented 2 months ago

Notes from chatting with Emil on this:

parsnip experiences the same issue with its fit() interface: the sparse data is passed to model.matrix(), which is somewhat slow but, even more importantly, introduces a substantial slowdown with many engines when dense data would have been better represented as sparse. So, we need to address the analogous problem in parsnip.

Approach 1) (with 3) in the long term) is adaptable to both situations and is relatively future-proof. We would raise a condition if users pass sparse data to fit.model_spec() (as in, with a formula) to tell them to transition to fit_xy() if they'd like to make use of sparsity.

Users will only pass fit.workflow(data) or fit.model_spec(data) as sparse data if they do so intentionally; this is "opt-in," and tidymodels doesn't do so by itself. If recipes, at some point in the future, introduces default behavior to coerce to sparsity in smart situations, users are already in the happy path, because they're using a recipe preprocessor with a workflow, and workflows always uses parsnip's fit_xy() interface (unless users have a model formula, in which case... there's a tricky formula and model.matrix() will be needed!).

Emil and I chatted and are on the same page here!

I originally proposed in 1) that we would warn in that case. I think a more tidyverse-style-esque approach would be to error and then reference an option (or control_*() argument) that users can set to say "I know I opted in to use sparse data but am choosing an interface that will coerce to dense and see a slowdown. Let me do it!"

tidymodels / workflows

address `fit()` slowdown with sparse tibble and formula preprocessor #246