Closed joranE closed 5 months ago
I'm not a maintainer here so can't answer all these questions, but I do work on LightGBM. So wanted to note ... any Dataset-relevant parameters passed through the keyword argument params
to lgb.train()
should also be set on the Dataset
that's passed into it.
Here's an example of that pattern from LightGBM's R tests:
Ah, thanks for the issue. Some notes-to-self as I think through this...
train_lightgbm()
currently supports passing any lgb.train(param())
arguments through ...
and sorting them internally. My temptation here, as long as we can differentiate between lgb.train(param())
and lgb.Dataset(param())
arguments, is to also have lgb.Dataset(param())
passed through ...
and sort accordingly.
Some other arguments appear in the signature of both lgb.Dataset()
and lgb.train()
. The remaining arguments (other than param
) that appear in the signature of lgb.Dataset()
and lgb.train()
are data
, colnames
, and categorical features
and are created inside of train_lightgbm()
and should be protected.
Ah, thanks @jameslamb! You beat me to it!
Will poke at those links.
Okay, given https://github.com/tidymodels/bonsai/issues/77#issuecomment-2045897282 this should definitely be doable. @jameslamb, are lightgbm:::.PARAMETER_ALIASES()
and lightgbm:::.DATASET_PARAMETERS()
the best sources of information for possible named arguments to params()
for lgb.train()
and lgb.Dataset()
, respectively? We can inline those results in bonsai.
Sorry for the delay, have been traveling for the last week.
re
lightgbm:::.PARAMETER_ALIASES()
andlightgbm:::.DATASET_PARAMETERS()
the best sources of information for possible named arguments to params forlgb.train()
andlgb.Dataset()
, respectively?
That's a complicated question to answer. Let's back up.
Why does {bonsai}
need to know the difference between parameters that affect LightGBM Dataset
construction and those that affect training with lgb.train()
? I don't understand what "the tidymodels engine argument mechanics don't allow for specifying where additional engine arguments should be directed" means. Apologize if this is an annoying question... I don't have any experience with {bonsai}
or {tidymodels}
other than trying to help when LightGBM questions come up here.
Sorry for the delay, have been traveling for the last week.
No worries at all! Hope the time away brought you what you needed.☀️
Apologize if this is an annoying question... I don't have any experience with {bonsai} or {tidymodels} other than trying to help when LightGBM questions come up here.
Totally makes sense. All good.
Why does {bonsai} need to know the difference between parameters that affect LightGBM Dataset construction and those that affect training with
lgb.train()
?
It might not. My previous mental model of the param()
arguments to lgb.train()
and lgb.Dataset()
was that lgb.Dataset()
should take only the arguments listed under the "Parameters" H1 and "Dataset parameters" H2 sections linked in their docs, respectively. Sounds like the reality is that any argument that can go to lgb.train()
's params
should go into lgb.Dataset()
's params
argument as well. Is it also true that any lgb.Dataset()
params
argument can/should go into lgb.train()
's params
argument? In that case, there's no need for bonsai to differentiate.
The following might be more detail than you need, but in case you'd appreciate more context:
The issue here is that bonsai takes care of constructing the calls to lgb.Dataset()
and lgb.train()
for the user. Some of the arguments to tidymodels' functions "overlap" with arguments to lightgbm arguments, albeit via different names and input type—those arguments are only allowed to be passed using the tidymodels interface which is consistent across "engines." (Engines are, loosely, just R packages, and could be lightgbm, xgboost, mboost, spark, etc. in this case.) Arguments from engines that don't overlap with arguments from tidymodels can be passed exactly as they'd be passed to the engine (i.e. most lightgbm arguments have the same names and values in both the tidymodels and lightgbm interface.) The tricky bit in this case is that some arguments in the param
argument overlap and some don't, so we've opted to allow tidymodels users to pass elements of the param
argument (previously thought only to be passed to lgb.train()
) directly through ellipses ...
, and then tidymodels takes care of merging the tidymodels arguments with the lightgbm arguments into one param
argument.
So, if it's possible to supply all lgb.train(param)
arguments to lgb.Dataset(param)
and vice versa, then response to my most recent comment is unneeded.
Thanks for that excellent explanation! I think I can help explain.
As long as {bonsai}
never tries to take the same lightgbm::Dataset
object and re-use it across multiple calls to lgb.train()
with different params
, it can safely pass all params
though both lgb.Dataset()
and lgb.train()
.
If it does want to create a Dataset
one time and use it across multiple lgb.train()
calls with different params
, then it might want to consider passing free_raw_data = True
to lgb.Dataset()
.
Sounds like the reality is that any argument that can go to lgb.train()'s params should go into lgb.Dataset()'s params argument as well.
"can", not "should". If you pass a non-Dataset parameter like learning_rate
to lgb.Dataset()
, {lightgbm}
will just ignore it.
library(lightgbm)
data(iris)
X_mat <- as.matrix(iris[, -5L])
label <- as.numeric(iris$Species) - 1L
# passing a non-dataset param to Dataset isn't a problem...
dtrain <- lightgbm::lgb.Dataset(
data = X_mat
, label = label
, params = list(
learning_rate = 0.1
, min_data_in_bin = 5L
)
)
dtrain$construct()
# ... LightGBM will just ignore it
# (notice that learning_rate is filtered out)
dtrain$get_params()
# $min_data_in_bin
# [1] 5
Similarly, if you pass a not-yet-constructed Dataset into lgb.train()
, along with a mix of params that control the Dataset and the boosting process, LightGBM will split them up for you
# important: note that I'm calling lgb.Dataset() here by not the $construct() method
dtrain <- lightgbm::lgb.Dataset(
data = X_mat
, label = label
)
# pass the Dataset into lgb.train(), along with a mix of Dataset and non-Dataset 'params'
bst <- lightgbm::lgb.train(
data = dtrain
, params = list(
learning_rate = 0.1
, min_data_in_bin = 5L
)
, nrounds = 5L
, obj = "binary"
)
# the Dataset has been constructed and only the Dataset-relevant params
# are stored on it
dtrain$get_params()
# $min_data_in_bin
# [1] 5
So {bonsai}
can just pass everything collected from ...
through to params
keyword argument in both lgb.Dataset()
and lgb.train()
, and rely on {lightgbm}
to sort them out. I think that's a better division of responsibilities... it should be {lightgbm}
's job to keep track of which parameters affect the Dataset
or the boosting process. And I'd prefer to not export those lists like .DATASET_PARAMS
if it can be avoided.
if it's possible to supply all lgb.train(param) arguments to lgb.Dataset(param) and vice versa
It is... but that can fail in the cases where you call lgb.Dataset()
once and then want to use the resulting Dataset
object across multiple calls to lgb.train()
with different parameters.
This is why above I was mentioning "not yet constructed" Datasets being passed into lgb.train()
. Some of the parameters can't be changed once a Dataset has been constructed. I'll give you an example.
# bin each feature into just 3 bins
dtrain <- lightgbm::lgb.Dataset(
data = X_mat
, label = label
, params = list(
max_bin = 3L
)
)
dtrain$construct()
# pass that to lgb.train(), but oops a different value of max_bin made it into params
bst <- lightgbm::lgb.train(
data = dtrain
, params = list(
max_bin = 255L
)
)
# [LightGBM] [Fatal] Cannot change max_bin after constructed Dataset handle.
# Error in doTryCatch(return(expr), name, parentenv, handler) :
# Cannot change max_bin after constructed Dataset handle.
This error message is saying "LightGBM can't do what you're asking. You have this Dataset where you've already grouped the raw data into 3 histogram bins per feature, now you're saying you want 255 histogram bins per feature, but the raw data isn't available any more".
If you want to always be able to change the parameters, even after construction, you can pass in free_raw_data=False
to lgb.Dataset()
. That tells {lightgbm}
to hold a copy of the raw data (in this case, an R Matrix), in memory as an attribute on the Dataset
R6 object. That's expensive memory-wise, but in exchange for being able to change the parameters like this via just calls to lgb.train()
.
# bin each feature into just 3 bins
dtrain <- lightgbm::lgb.Dataset(
data = X_mat
, label = label
, params = list(
max_bin = 3L
)
, free_raw_data = FALSE
)
dtrain$construct()
# pass that to lgb.train(), but oops a different value of max_bin made it into params
bst <- lightgbm::lgb.train(
data = dtrain
, params = list(
max_bin = 255L
)
)
# no problem, inside lgb.train(), {lightgbm} changed the Dataset's parameters and re-constructed it
dtrain$get_params()
# $max_bin
# [1] 255
The current engine implementation for
lightgbm
sends additional engine arguments to the functionlightgbm::lgb.train
. However, one of the dataset parameters is really a model option,linear_tree
, which fits linear regression models in the leaf nodes rather than constant models in each leaf.Currently, the tidymodels engine argument mechanics don't allow for specifying where additional engine arguments should be directed. One could simply add a single fixed argument for
linear_tree
as I did in this crude example for experimentation, but that's not really a long term solution, if there are other lightgbm arguments that people would like to access in a similar manner.Would it be possible to more fully expose engine arguments for lightgbm in a way that distinguishes between those intended for
lgb.train()
vs.lgb.Dataset()
, or is an argument by argument method the best we can do?