tidymodels / bonsai

parsnip wrappers for tree-based models
https://bonsai.tidymodels.org
Other
51 stars 7 forks source link

Feature idea - Linking hyperparameters during CV #49

Open dfsnow opened 2 years ago

dfsnow commented 2 years ago

Problem

Within LightGBM, num_leaves is capped at 2 ^ max_depth. For example, if num_leaves is set to 1000 and max_depth is set to 5, then LightGBM will likely end up creating a full-depth tree with 32 (2 ^ 5) leaves per iteration.

{bonsai} / {parsnip} have no knowledge of the relationship between these parameters. As a result, during cross-validation, Bayesian optimization and other CV search methods will spend a significant amount of time exploring meaningless hyperparameter space where num_leaves > 2 ^ max_depth. This results in longer CV times, especially for large models with many parameters.

Idea

One potential solution is to explicitly link num_leaves and max_depth specifically for the LightGBM model spec. I implemented this link in my treesnip fork by essentially adding two engine arguments:

  1. link_max_depth - Boolean. When FALSE, max_depth is equal to whatever is passed via engine/model arg. When TRUE, max_depth is equal to {floor(log2(num_leaves)) + link_max_depth_add.
  2. link_max_depth_add - Integer. Value added to max_depth. For example, if link_max_depth is TRUE, num_leaves is 1000, and link_max_depth_add is 2, then max_depth = floor(log2(1000)) + 2, or 11.

This would improve cross-validation times by restricting the hyperparameter space that needs to be explored while leaving the default options untouched. Ideally, it could even be generalized (within {parsnip}) to other model types that have intrinsically linked hyperparameters. However, not sure if this fits with the Tidymodels way of doing things. If it's totally out-of-scope, then feel free to close this issue.

simonpcouch commented 2 years ago

Thanks for the issue! This seems worth looking into and also like it may have applications/need beyond this extension package—will chat about this with @topepo and get back to you sooner than later. :)

topepo commented 2 years ago

I think that the best way to handle this is to make methods for the grid_*() functions for workflows and model specifications. That's really the only time that we could intercept the parameters and add a constraint (for a specific model).

We'll discuss this.

dfsnow commented 2 years ago

Appreciate the attention on this issue! Let me know if there's any way I can assist (debugging, testing, PR, etc.).

I'll add that the grid_*() functions don't really have this issue, since you can manually filter or create a hyperparameter grid with such constraints built-in prior to CV. IMO this issue is more applicable to tune_bayes(), since you can't filter/intercept the hyperparameters chosen by the sub-model.

That said, general methods for doing this linking/filtering with grid_*() functions would still be incredibly useful.