Upstream adjustments for Learners

pat-s commented 4 years ago

Hi @kapsner,

we've made some upstream adjustments to all learners which I would like you to be aware of:

Changes to .onLoad() and .onUnload() function. You can c/p zzz.R from the mlr3learnerstemplate
New mandatory man slot in the constructor of the learner
train_internal and predict_internal are now private methods with names .train() and predict()
new documentation templates in man-roxygen

In addition I had a quick look over your code base. Note that this is not an extensive review:

You should declare the ParamSet during construction of the learner
If I am not wrong there are two functions in ght {lightgbm} package which the user can execute: lgb.cv and ´lgb.train`? Looks like the first does an internal tuning whereas the latter leaves this to the user? These two should get their own class and result in two learners in the end - so that the user can decide which one to use.
In general, have a look at the linked mlr3learners.template repo to align your design to this template :) Your learner is for sure more complicated than an average learner but aligning somewhat closely to the template makes things easier

kapsner commented 4 years ago

@pat-s, thank you very much for your review and drawing attention to the adjustments of the mlr3learnerstemplate. I will have a closer look and implement them accordingly in the next week(s).

Regarding your suggestions:

You should declare the ParamSet during construction of the learner

The ParamSet is already declared in the superclass "LightGBM" in its initialize function (line 222). The lgbparams function referenced there returns this ParamSet. The superclass is then instantiated in both learners (Classifier-class and Regressor-class).
I do not know, if this makes sense to you - however, I tried to avoid redundant code and it works basically.

Looks like the first does an internal tuning whereas the latter leaves this to the user? These two should get their own class and result in two learners in the end - so that the user can decide which one to use.

The function lgb.cv is only needed for getting the best number of training rounds (nrounds argument) in conjunction with the argument early_stopping_rounds. Since it has no predict-method, I do currently not see a necessity to separate it into an extra learner. Instead, the resulting number of training rounds best_iter from the lgb.cv-model is directly passed to lgb.train for training the final model (and users do not have to manually extract this value from the CV-model and pass it to another learner).

However, in the current implementation users have full control over the cross-validation with the following 'switches' in the superclass:

nrounds_by_cv: whether or not to use cross-validation for finding the best number of iterations (default = TRUE)
cv_folds: the number of CV-folds (default: 5)

If nrounds_by_cv is set to FALSE, a user can train a final model with a given number of boosting iterations

either 'blindly', without 'early stopping'
or with 'early stopping' by providing a validation* set via the learner's 'valids'-method (* which works similar to xgb.train's watchlist).

kapsner commented 4 years ago

TODOs:

[x] aligning to new zzz.R file
[x] moving train_internal and predict_internal to private as suggested above
[x] adding new mandatory man-slot in constructor
[x] copy man-roxygen to package
[x] declaring of parameter subsets in regressor and classifier

kapsner commented 4 years ago

@pat-s I revised the package accordingly and also removed the superclass. Unfortunately, now a new error is introduced during unit tests when calling the "expect_learner"-function: Check on matches$matches isn't true. Must have at least 1 rows, but has 0 rows Any idea, where this message comes from?

pat-s commented 4 years ago

Thanks. Looks way cleaner already - we're getting there :)

I've looked a bit deeper into the source code and ... oh dear, if there would be a price for implementing an algorithm in an extra complicated way they could compete with xgboost 😅

Any idea, where this message comes from?

The reasons is that you had an _ instead of a . in man = "mlr3learners.lightgbm::mlr_learners_classif.lightgbm".

In addition two example tasks fail the classif autotest because n cannot be found. This should be solvable here:

https://github.com/mlr3learners/mlr3learners.lightgbm/blob/91387f29c9fe10ce939a5aa941a4210d948818ab/R/LearnerClassifLightGBM.R#L616-L624

Questions

Can you explain the "custom_eval" parameter?
Why is {MLmetrics} needed?
If config and train pars are passed within one argument to invoke() there is no real need to distinguish between both in the ParamSet.
Does setting pars[["early_stopping_round"]] = NULL has any influence compared to not doing it? If not, I'd prefer to remove the line.
Please comment on all excluded parameters in the ParamTest why they are excluded

kapsner commented 4 years ago

In addition two example tasks fail the classif autotest because n cannot be found.

I have re-added the definition of 'n' here: https://github.com/mlr3learners/mlr3learners.lightgbm/blob/development/R/LearnerClassifLightGBM.R#L619 We definitely have to discuss this: when using multiclass-classification, the parameter "num_class" needs to be set - normally, this can be done by the user. But mlr3 autotests will fail if not automatically set there.

To answer your questions:

Can you explain the "custom_eval" parameter?

This is the implementation of lgb.train's eval parameter which also allows to provide custom metrics. I plan to provide some custom metrics with this package, which are currently not available by default with the learner. One custom metric, "rmsle" is already included. It can be provided to the learner in the following way:

learner$param_set$values = mlr3misc::insert_named(
  learner$param_set$values,
    list(
    "early_stopping_round" = 10,
    "learning_rate" = 0.1,
    "num_iterations" = 100,
    "custom_eval" = mlr3learners.lightgbm::lgb_rmsle
  ))

Why is {MLmetrics} needed?

For custom metrics, such as the "rmsle" custom metrics and future custom metrics.

If config and train pars are passed within one argument to invoke() there is no real need to distinguish between both in the ParamSet.

To avoid confusion, I renamed config to args, since most of them are function arguments to lgb.train and not natively included in the params argument. During training, the ParamSet is provided to the params argument. Therefore args have to be removed from the actual training parameters, which is done here by filtering the args: https://github.com/mlr3learners/mlr3learners.lightgbm/blob/development/R/LearnerClassifLightGBM.R#L688

Does setting pars[["early_stopping_round"]] = NULL has any influence compared to not doing it? If not, I'd prefer to remove the line.

Yes, it has influence. When using nrounds_by_cv, early_stopping_round can be used. Subsequently, the resulting number of iterations is automatically passed to lgb.train to train a final model (for prediction etc.). If not resetting early_stopping_round, an error would be thrown. In this scenario, it wouldn't even make sense to introduce another early stopping in the final training.

Please comment on all excluded parameters in the ParamTest why they are excluded

I did this directly in the code. If I understand this test correctly, I think that for lightgbm it would make more sense to check for the actual learning parameters, which are provided to lgb.train's params argument instead of checking for lgb.train's function arguments. Some of them are already included with the params (altough with different aliases), others are not necessarily needed.

pat-s commented 4 years ago

when using multiclass-classification, the parameter "num_class" needs to be set - normally, this can be done by the user. But mlr3 autotests will fail if not automatically set there.

If it only has affect for multiclass tests, you can set it during constrution for the tests. Otherwise exclude the multiclass tests and run them standalone outside the "autotest".

Metrics

Does it make sense to ship some metrics with a learner? I'd prefer to add these into mlr3measures but in the end this is your decision. If the measures are optional the {MLmetrics} could/should go into "Suggests" instead of "Imports" so that it is not auto-installed when installing the learner.

Yes, it has influence. When using nrounds_by_cv, early_stopping_round can be used. Subsequently, the resulting number of iterations is automatically passed to lgb.train to train a final model (for prediction etc.). If not resetting early_stopping_round, an error would be thrown. In this scenario, it wouldn't even make sense to introduce another early stopping in the final training.

OK 🙂

I did this directly in the code. If I understand this test correctly, I think that for lightgbm it would make more sense to check for the actual learning parameters, which are provided to lgb.train's params argument instead of checking for lgb.train's function arguments. Some of them are already included with the params (altough with different aliases), others are not necessarily needed.

I don't have an overview which functions of the upstream package contribute to the overall availability of Parameters for the mlr3learner. However, the ParamTest should just make it obvious which ones are missing from all of these functions and why. Users checking the ParamTest would expect the piece of information there on why the param is not included in the learner.

Also, if the Param is available in the learner why would it be excluded in the ParamTest then (?) 😄 Did you maybe misread it as "included" by change?

kapsner commented 4 years ago

If it only has affect for multiclass tests, you can set it during constrution for the tests.

Do the response variables of all multiclass tests have the same number of levels?

Does it make sense to ship some metrics with a learner

These metrics are required for the internal early stopping. Unfortunately, mlr3measures won't work there.

Did you maybe misread it as "included" by change?

:) Think you are right. I adjusted the tests accordingly, however, they now fail for the parameters data and eval although these arguments are being specified in the learner's function call of "lightgbm.train" (https://github.com/mlr3learners/mlr3learners.lightgbm/blob/master/R/LearnerClassifLightGBM.R#L734)...

pat-s commented 4 years ago

Do the response variables of all multiclass tests have the same number of levels?

IDK, you would have to look that up in the mlr3 test helpers

These metrics are required for the internal early stopping. Unfortunately, mlr3measures won't work there.

Can the measures you use from the MLMetrics package be included into mlr3measures? You don't need to do this yourself but ask for it in the repo. Shouldn't be huge task and makes mlr3measures even better.

I adjusted the tests accordingly, however, they now fail for the parameters data and eval although these arguments are being specified in the learner's function call of "lightgbm.train

The latest paramtests fail because lightgbm was not installed? Haven't looked through previous runs though.

kapsner commented 4 years ago

Can the measures you use from the MLMetrics package be included into mlr3measures?

RMSLE is already available via mlr3measures. The reason to integrate custom metrics here is more related to the required function design than the availibility of the measure. In order to work with lgb.train, the function requires the dtrain-argument, an lgb.Dataset-object, which is used to access the ground truth labels (e.g. https://github.com/mlr3learners/mlr3learners.lightgbm/blob/master/R/eval_rmsle.R#L10). So I think it would be unavoidable to write these custom metric wrapper-functions. However, I could try to replace the call of the MLmetrics-package with a call to the mlr3measures package (e.g. using the measure regr.rmsle). The success depends, if I am able to access the measure-calculation with mlr3measures in the same manner as MLmetrics::RMSLE(y_pred = preds, y_true = label).

If I did not overlook it, the second metric 'area under the precision recall curve (PR-AUC)' is not yet available via mlr3measures, so it could make sense to add it there due to its strengths for evaluating binary classifiers on imbalanced datasets (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349800/).

The latest paramtests fail because lightgbm was not installed?

No, they fail when running them manually with lightgbm installed. Do they check all the arguments of the underlying function (here lgb.train) or only the parameters-argument (https://lightgbm.readthedocs.io/en/latest/R/reference/lgb.train.html)?

pat-s commented 4 years ago

If all the measures stuff is so complicated and time-consuming in the end, I'd say just leave it like it is now and focus on more important things :) But it's up to you of course.

No, they fail when running them manually with lightgbm installed. Do they check all the arguments of the underlying function (here lgb.train) or only the parameters-argument

All top-level arguments of a function so eval and data should be checked.

mlr3learners / mlr3learners.lightgbm

Upstream adjustments for Learners #1

Questions