Open pat-s opened 4 years ago
@pat-s, thank you very much for your review and drawing attention to the adjustments of the mlr3learnerstemplate. I will have a closer look and implement them accordingly in the next week(s).
Regarding your suggestions:
You should declare the ParamSet during construction of the learner
The ParamSet is already declared in the superclass "LightGBM" in its initialize
function (line 222). The lgbparams
function referenced there returns this ParamSet. The superclass is then instantiated in both learners (Classifier-class and Regressor-class).
I do not know, if this makes sense to you - however, I tried to avoid redundant code and it works basically.
Looks like the first does an internal tuning whereas the latter leaves this to the user? These two should get their own class and result in two learners in the end - so that the user can decide which one to use.
The function lgb.cv
is only needed for getting the best number of training rounds (nrounds
argument) in conjunction with the argument early_stopping_rounds
. Since it has no predict
-method, I do currently not see a necessity to separate it into an extra learner. Instead, the resulting number of training rounds best_iter
from the lgb.cv
-model is directly passed to lgb.train
for training the final model (and users do not have to manually extract this value from the CV-model and pass it to another learner).
However, in the current implementation users have full control over the cross-validation with the following 'switches' in the superclass:
nrounds_by_cv
: whether or not to use cross-validation for finding the best number of iterations (default = TRUE)cv_folds
: the number of CV-folds (default: 5)If nrounds_by_cv
is set to FALSE
, a user can train a final model with a given number of boosting iterations
xgb.train
's watchlist).TODOs:
@pat-s I revised the package accordingly and also removed the superclass. Unfortunately, now a new error is introduced during unit tests when calling the "expect_learner"-function:
Check on matches$matches isn't true. Must have at least 1 rows, but has 0 rows
Any idea, where this message comes from?
Thanks. Looks way cleaner already - we're getting there :)
I've looked a bit deeper into the source code and ... oh dear, if there would be a price for implementing an algorithm in an extra complicated way they could compete with xgboost 😅
Any idea, where this message comes from?
The reasons is that you had an _
instead of a .
in man = "mlr3learners.lightgbm::mlr_learners_classif.lightgbm"
.
In addition two example tasks fail the classif autotest because n
cannot be found. This should be solvable here:
config
and train
pars are passed within one argument to invoke()
there is no real need to distinguish between both in the ParamSet.pars[["early_stopping_round"]] = NULL
has any influence compared to not doing it? If not, I'd prefer to remove the line.In addition two example tasks fail the classif autotest because n cannot be found.
I have re-added the definition of 'n' here: https://github.com/mlr3learners/mlr3learners.lightgbm/blob/development/R/LearnerClassifLightGBM.R#L619 We definitely have to discuss this: when using multiclass-classification, the parameter "num_class" needs to be set - normally, this can be done by the user. But mlr3 autotests will fail if not automatically set there.
To answer your questions:
Can you explain the "custom_eval" parameter?
This is the implementation of lgb.train's eval parameter which also allows to provide custom metrics. I plan to provide some custom metrics with this package, which are currently not available by default with the learner. One custom metric, "rmsle" is already included. It can be provided to the learner in the following way:
learner$param_set$values = mlr3misc::insert_named(
learner$param_set$values,
list(
"early_stopping_round" = 10,
"learning_rate" = 0.1,
"num_iterations" = 100,
"custom_eval" = mlr3learners.lightgbm::lgb_rmsle
))
Why is {MLmetrics} needed?
For custom metrics, such as the "rmsle" custom metrics and future custom metrics.
If
config
andtrain
pars are passed within one argument to invoke() there is no real need to distinguish between both in the ParamSet.
To avoid confusion, I renamed config
to args
, since most of them are function arguments to lgb.train
and not natively included in the params
argument. During training, the ParamSet is provided to the params
argument. Therefore args
have to be removed from the actual training parameters, which is done here by filtering the args
: https://github.com/mlr3learners/mlr3learners.lightgbm/blob/development/R/LearnerClassifLightGBM.R#L688
Does setting
pars[["early_stopping_round"]] = NULL
has any influence compared to not doing it? If not, I'd prefer to remove the line.
Yes, it has influence. When using nrounds_by_cv
, early_stopping_round
can be used. Subsequently, the resulting number of iterations is automatically passed to lgb.train
to train a final model (for prediction etc.). If not resetting early_stopping_round
, an error would be thrown. In this scenario, it wouldn't even make sense to introduce another early stopping in the final training.
Please comment on all excluded parameters in the ParamTest why they are excluded
I did this directly in the code. If I understand this test correctly, I think that for lightgbm it would make more sense to check for the actual learning parameters, which are provided to lgb.train
's params
argument instead of checking for lgb.train
's function arguments. Some of them are already included with the params (altough with different aliases), others are not necessarily needed.
when using multiclass-classification, the parameter "num_class" needs to be set - normally, this can be done by the user. But mlr3 autotests will fail if not automatically set there.
If it only has affect for multiclass tests, you can set it during constrution for the tests. Otherwise exclude the multiclass tests and run them standalone outside the "autotest".
Metrics
Does it make sense to ship some metrics with a learner? I'd prefer to add these into mlr3measures but in the end this is your decision. If the measures are optional the {MLmetrics} could/should go into "Suggests" instead of "Imports" so that it is not auto-installed when installing the learner.
Yes, it has influence. When using nrounds_by_cv, early_stopping_round can be used. Subsequently, the resulting number of iterations is automatically passed to lgb.train to train a final model (for prediction etc.). If not resetting early_stopping_round, an error would be thrown. In this scenario, it wouldn't even make sense to introduce another early stopping in the final training.
OK 🙂
I did this directly in the code. If I understand this test correctly, I think that for lightgbm it would make more sense to check for the actual learning parameters, which are provided to lgb.train's params argument instead of checking for lgb.train's function arguments. Some of them are already included with the params (altough with different aliases), others are not necessarily needed.
I don't have an overview which functions of the upstream package contribute to the overall availability of Parameters for the mlr3learner. However, the ParamTest should just make it obvious which ones are missing from all of these functions and why. Users checking the ParamTest would expect the piece of information there on why the param is not included in the learner.
Also, if the Param is available in the learner why would it be excluded in the ParamTest then (?) 😄 Did you maybe misread it as "included" by change?
If it only has affect for multiclass tests, you can set it during constrution for the tests.
Do the response variables of all multiclass tests have the same number of levels?
Does it make sense to ship some metrics with a learner
These metrics are required for the internal early stopping. Unfortunately, mlr3measures won't work there.
Did you maybe misread it as "included" by change?
:) Think you are right. I adjusted the tests accordingly, however, they now fail for the parameters data
and eval
although these arguments are being specified in the learner's function call of "lightgbm.train" (https://github.com/mlr3learners/mlr3learners.lightgbm/blob/master/R/LearnerClassifLightGBM.R#L734)...
Do the response variables of all multiclass tests have the same number of levels?
IDK, you would have to look that up in the mlr3 test helpers
These metrics are required for the internal early stopping. Unfortunately, mlr3measures won't work there.
Can the measures you use from the MLMetrics package be included into mlr3measures? You don't need to do this yourself but ask for it in the repo. Shouldn't be huge task and makes mlr3measures even better.
I adjusted the tests accordingly, however, they now fail for the parameters data and eval although these arguments are being specified in the learner's function call of "lightgbm.train
The latest paramtests fail because lightgbm was not installed? Haven't looked through previous runs though.
Can the measures you use from the MLMetrics package be included into mlr3measures?
RMSLE is already available via mlr3measures. The reason to integrate custom metrics here is more related to the required function design than the availibility of the measure.
In order to work with lgb.train
, the function requires the dtrain
-argument, an lgb.Dataset
-object, which is used to access the ground truth labels (e.g. https://github.com/mlr3learners/mlr3learners.lightgbm/blob/master/R/eval_rmsle.R#L10).
So I think it would be unavoidable to write these custom metric wrapper-functions.
However, I could try to replace the call of the MLmetrics
-package with a call to the mlr3measures
package (e.g. using the measure regr.rmsle
). The success depends, if I am able to access the measure-calculation with mlr3measures in the same manner as MLmetrics::RMSLE(y_pred = preds, y_true = label)
.
If I did not overlook it, the second metric 'area under the precision recall curve (PR-AUC)' is not yet available via mlr3measures
, so it could make sense to add it there due to its strengths for evaluating binary classifiers on imbalanced datasets (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349800/).
The latest paramtests fail because lightgbm was not installed?
No, they fail when running them manually with lightgbm installed. Do they check all the arguments of the underlying function (here lgb.train
) or only the parameters-argument (https://lightgbm.readthedocs.io/en/latest/R/reference/lgb.train.html)?
If all the measures stuff is so complicated and time-consuming in the end, I'd say just leave it like it is now and focus on more important things :) But it's up to you of course.
No, they fail when running them manually with lightgbm installed. Do they check all the arguments of the underlying function (here lgb.train) or only the parameters-argument
All top-level arguments of a function so eval
and data
should be checked.
Hi @kapsner,
we've made some upstream adjustments to all learners which I would like you to be aware of:
.onLoad()
and.onUnload()
function. You can c/p zzz.R from the mlr3learnerstemplateman
slot in the constructor of the learnertrain_internal
andpredict_internal
are now private methods with names.train()
andpredict()
man-roxygen
In addition I had a quick look over your code base. Note that this is not an extensive review:
lgb.cv
and ´lgb.train`? Looks like the first does an internal tuning whereas the latter leaves this to the user? These two should get their own class and result in two learners in the end - so that the user can decide which one to use.