Method for passing linear_tree or other params that should be routed to lgb.Dataset rather than lgb.train

The current engine implementation for lightgbm sends additional engine arguments to the function lightgbm::lgb.train. However, one of the dataset parameters is really a model option, linear_tree, which fits linear regression models in the leaf nodes rather than constant models in each leaf.

Currently, the tidymodels engine argument mechanics don't allow for specifying where additional engine arguments should be directed. One could simply add a single fixed argument for linear_tree as I did in this crude example for experimentation, but that's not really a long term solution, if there are other lightgbm arguments that people would like to access in a similar manner.

Would it be possible to more fully expose engine arguments for lightgbm in a way that distinguishes between those intended for lgb.train() vs. lgb.Dataset(), or is an argument by argument method the best we can do?

I'm not a maintainer here so can't answer all these questions, but I do work on LightGBM. So wanted to note ... any Dataset-relevant parameters passed through the keyword argument params to lgb.train() should also be set on the Dataset that's passed into it.

https://github.com/microsoft/LightGBM/blob/28536a0a5d0c46598fbc930a63cec72399a969e4/R-package/R/lgb.train.R#L157-L158

Here's an example of that pattern from LightGBM's R tests:

https://github.com/microsoft/LightGBM/blob/28536a0a5d0c46598fbc930a63cec72399a969e4/R-package/tests/testthat/test_basic.R#L2519-L2534

Ah, thanks for the issue. Some notes-to-self as I think through this...

train_lightgbm() currently supports passing any lgb.train(param()) arguments through ... and sorting them internally. My temptation here, as long as we can differentiate between lgb.train(param()) and lgb.Dataset(param()) arguments, is to also have lgb.Dataset(param()) passed through ... and sort accordingly.

Some other arguments appear in the signature of both lgb.Dataset() and lgb.train(). The remaining arguments (other than param) that appear in the signature of lgb.Dataset() and lgb.train() are data, colnames, and categorical features and are created inside of train_lightgbm() and should be protected.

Ah, thanks @jameslamb! You beat me to it!

Will poke at those links.

Quickly making sure I'm interpreting "should also be set" correctly, free to ignore

``` r library(lightgbm) library(dplyr) #> #> Attaching package: 'dplyr' #> The following object is masked from 'package:lightgbm': #> #> slice #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union data("penguins", package = "modeldata") penguins <- na.omit(penguins) peng <- penguins %>% mutate(across(where(is.character), ~as.factor(.x))) %>% mutate(across(where(is.factor), ~as.integer(.x) - 1)) peng_y <- peng$bill_length_mm peng_m <- peng %>% select(-bill_length_mm) %>% as.matrix() peng_linear_tree_no <- lgb.Dataset( data = peng_m, label = peng_y, params = list(feature_pre_filter = FALSE, linear_tree = TRUE), categorical_feature = c(1L, 2L, 6L) ) peng_linear_tree_yes <- lgb.Dataset( data = peng_m, label = peng_y, params = list(feature_pre_filter = FALSE, linear_tree = FALSE), categorical_feature = c(1L, 2L, 6L) ) # pass linear_tree to lgb.Dataset lgbm_fit_linear_tree_dataset <- lightgbm::lgb.train( data = peng_linear_tree_yes, params = list(objective = "regression"), verbose = -1 ) # pass linear_tree to lgb.train lgbm_fit_linear_tree_train <- lightgbm::lgb.train( data = peng_linear_tree_no, params = list(objective = "regression", linear_tree = TRUE), verbose = -1 ) # pass linear_tree to neither (default is false) lgbm_fit_linear_tree_none <- lightgbm::lgb.train( data = peng_linear_tree_no, params = list(objective = "regression"), verbose = -1 ) preds_dataset <- predict(lgbm_fit_linear_tree_dataset, peng_m) preds_train <- predict(lgbm_fit_linear_tree_train, peng_m) preds_none <- predict(lgbm_fit_linear_tree_none, peng_m) all.equal(preds_dataset, preds_train) #> [1] "Mean relative difference: 0.009559404" all.equal(preds_train, preds_none) #> [1] TRUE ``` ^{Created on 2024-04-09 with [reprex v2.1.0](https://reprex.tidyverse.org)}

Okay, given https://github.com/tidymodels/bonsai/issues/77#issuecomment-2045897282 this should definitely be doable. @jameslamb, are lightgbm:::.PARAMETER_ALIASES() and lightgbm:::.DATASET_PARAMETERS() the best sources of information for possible named arguments to params() for lgb.train() and lgb.Dataset(), respectively? We can inline those results in bonsai.

Sorry for the delay, have been traveling for the last week.

re lightgbm:::.PARAMETER_ALIASES() and lightgbm:::.DATASET_PARAMETERS() the best sources of information for possible named arguments to params for lgb.train() and lgb.Dataset(), respectively?

That's a complicated question to answer. Let's back up.

Why does {bonsai} need to know the difference between parameters that affect LightGBM Dataset construction and those that affect training with lgb.train()? I don't understand what "the tidymodels engine argument mechanics don't allow for specifying where additional engine arguments should be directed" means. Apologize if this is an annoying question... I don't have any experience with {bonsai} or {tidymodels} other than trying to help when LightGBM questions come up here.

Sorry for the delay, have been traveling for the last week.

No worries at all! Hope the time away brought you what you needed.☀️

Apologize if this is an annoying question... I don't have any experience with {bonsai} or {tidymodels} other than trying to help when LightGBM questions come up here.

Totally makes sense. All good.

Why does {bonsai} need to know the difference between parameters that affect LightGBM Dataset construction and those that affect training with lgb.train()?

It might not. My previous mental model of the param() arguments to lgb.train() and lgb.Dataset() was that lgb.Dataset() should take only the arguments listed under the "Parameters" H1 and "Dataset parameters" H2 sections linked in their docs, respectively. Sounds like the reality is that any argument that can go to lgb.train()'s params should go into lgb.Dataset()'s params argument as well. Is it also true that any lgb.Dataset() params argument can/should go into lgb.train()'s params argument? In that case, there's no need for bonsai to differentiate.

The following might be more detail than you need, but in case you'd appreciate more context:

The issue here is that bonsai takes care of constructing the calls to lgb.Dataset() and lgb.train() for the user. Some of the arguments to tidymodels' functions "overlap" with arguments to lightgbm arguments, albeit via different names and input type—those arguments are only allowed to be passed using the tidymodels interface which is consistent across "engines." (Engines are, loosely, just R packages, and could be lightgbm, xgboost, mboost, spark, etc. in this case.) Arguments from engines that don't overlap with arguments from tidymodels can be passed exactly as they'd be passed to the engine (i.e. most lightgbm arguments have the same names and values in both the tidymodels and lightgbm interface.) The tricky bit in this case is that some arguments in the param argument overlap and some don't, so we've opted to allow tidymodels users to pass elements of the param argument (previously thought only to be passed to lgb.train()) directly through ellipses ..., and then tidymodels takes care of merging the tidymodels arguments with the lightgbm arguments into one param argument.

So, if it's possible to supply all lgb.train(param) arguments to lgb.Dataset(param) and vice versa, then response to my most recent comment is unneeded.

Thanks for that excellent explanation! I think I can help explain.

Short Answer

As long as {bonsai} never tries to take the same lightgbm::Dataset object and re-use it across multiple calls to lgb.train() with different params, it can safely pass all params though both lgb.Dataset() and lgb.train().

If it does want to create a Dataset one time and use it across multiple lgb.train() calls with different params, then it might want to consider passing free_raw_data = True to lgb.Dataset().

Details

Sounds like the reality is that any argument that can go to lgb.train()'s params should go into lgb.Dataset()'s params argument as well.

"can", not "should". If you pass a non-Dataset parameter like learning_rate to lgb.Dataset(), {lightgbm} will just ignore it.

library(lightgbm)
data(iris)

X_mat <- as.matrix(iris[, -5L])
label <- as.numeric(iris$Species) - 1L

# passing a non-dataset param to Dataset isn't a problem...
dtrain <- lightgbm::lgb.Dataset(
    data = X_mat
    , label = label
    , params = list(
        learning_rate = 0.1
        , min_data_in_bin = 5L
    )
)
dtrain$construct()

# ... LightGBM will just ignore it
#     (notice that learning_rate is filtered out)
dtrain$get_params()
# $min_data_in_bin
# [1] 5

Similarly, if you pass a not-yet-constructed Dataset into lgb.train(), along with a mix of params that control the Dataset and the boosting process, LightGBM will split them up for you

# important: note that I'm calling lgb.Dataset() here by not the $construct() method
dtrain <- lightgbm::lgb.Dataset(
    data = X_mat
    , label = label
)

# pass the Dataset into lgb.train(), along with a mix of Dataset and non-Dataset 'params'
bst <- lightgbm::lgb.train(
    data = dtrain
    , params = list(
        learning_rate = 0.1
        , min_data_in_bin = 5L
    )
    , nrounds = 5L
    , obj = "binary"
)

# the Dataset has been constructed and only the Dataset-relevant params
# are stored on it
dtrain$get_params()
# $min_data_in_bin
# [1] 5

So {bonsai} can just pass everything collected from ... through to params keyword argument in both lgb.Dataset() and lgb.train(), and rely on {lightgbm} to sort them out. I think that's a better division of responsibilities... it should be {lightgbm}'s job to keep track of which parameters affect the Dataset or the boosting process. And I'd prefer to not export those lists like .DATASET_PARAMS if it can be avoided.

if it's possible to supply all lgb.train(param) arguments to lgb.Dataset(param) and vice versa

It is... but that can fail in the cases where you call lgb.Dataset() once and then want to use the resulting Dataset object across multiple calls to lgb.train() with different parameters.

This is why above I was mentioning "not yet constructed" Datasets being passed into lgb.train(). Some of the parameters can't be changed once a Dataset has been constructed. I'll give you an example.

# bin each feature into just 3 bins
dtrain <- lightgbm::lgb.Dataset(
    data = X_mat
    , label = label
    , params = list(
        max_bin = 3L
    )
)
dtrain$construct()

# pass that to lgb.train(), but oops a different value of max_bin made it into params
bst <- lightgbm::lgb.train(
    data = dtrain
    , params = list(
        max_bin = 255L
    )
)

# [LightGBM] [Fatal] Cannot change max_bin after constructed Dataset handle.
# Error in doTryCatch(return(expr), name, parentenv, handler) : 
#  Cannot change max_bin after constructed Dataset handle.

This error message is saying "LightGBM can't do what you're asking. You have this Dataset where you've already grouped the raw data into 3 histogram bins per feature, now you're saying you want 255 histogram bins per feature, but the raw data isn't available any more".

If you want to always be able to change the parameters, even after construction, you can pass in free_raw_data=False to lgb.Dataset(). That tells {lightgbm} to hold a copy of the raw data (in this case, an R Matrix), in memory as an attribute on the Dataset R6 object. That's expensive memory-wise, but in exchange for being able to change the parameters like this via just calls to lgb.train().

# bin each feature into just 3 bins
dtrain <- lightgbm::lgb.Dataset(
    data = X_mat
    , label = label
    , params = list(
        max_bin = 3L
    )
    , free_raw_data = FALSE
)
dtrain$construct()

# pass that to lgb.train(), but oops a different value of max_bin made it into params
bst <- lightgbm::lgb.train(
    data = dtrain
    , params = list(
        max_bin = 255L
    )
)

# no problem, inside lgb.train(), {lightgbm} changed the Dataset's parameters and re-constructed it
dtrain$get_params()
# $max_bin
# [1] 255

tidymodels / bonsai

Method for passing linear_tree or other params that should be routed to lgb.Dataset rather than lgb.train #77

Short Answer

Details