[RFC] [R-package] Replace "info" interface in lgb.Dataset with keyword arguments

jameslamb commented 3 years ago

Summary

The following changes should be made to lgb.Dataset() in the R package.

"deprecated" = "supported, but raises a warning if used".

In release 3.3.0 (#4310)

[x] deprecate keyword argument info in lgb.Dataset()
- https://github.com/microsoft/LightGBM/pull/4573
[x] add keyword arguments group, weight, init_score, and label to lgb.Dataset()
- https://github.com/microsoft/LightGBM/pull/4573
[x] deprecate passing anything through ... in lgb.Dataset() (https://github.com/microsoft/LightGBM/issues/4226#issuecomment-829473584)
- https://github.com/microsoft/LightGBM/pull/4573
[x] deprecate Dataset$getinfo() (with a warning its name will be changed to get_field())
[x] deprecate Dataset$setinfo() (with a warning its name will be changed to set_field())
[x] add method Dataset$get_field(field_name) to Dataset, matching the Python package
- https://github.com/microsoft/LightGBM/blob/8a90ea3f267a81a529e3f069cc13e0f6320e7989/python-package/lightgbm/basic.py#L1939
[x] add method Dataset$set_field(field_name, data) should be added to Dataset, matching the Python package
- https://github.com/microsoft/LightGBM/blob/8a90ea3f267a81a529e3f069cc13e0f6320e7989/python-package/lightgbm/basic.py#L1890

In release 4.0.0

[x] remove ... from lgb.Dataset()
- 4874
[x] remove all deprecation warnings added for release 3.3.0
- several issues linked to this one
[x] remove Dataset$getinfo()
- 4864
[x] remove info from lgb.Dataset()
- 4866
[x] remove Dataset$setinfo()
- 4854

Motivation

reduces maintenance burden by making the R package more closely resemble the Python package
improves usability, especially for users working in IDEs like RStudio
- elevating properties to keyword arguments allows for tab-completion and inline documentation references (https://github.com/microsoft/LightGBM/issues/4226#issuecomment-826523570, https://github.com/microsoft/LightGBM/issues/4226#issuecomment-827145419)
reduces the volume of deprecation warnings for users of version 3.3.0 (since weight, init_score, etc. will match keyword args and not be part of ...)
reduces the risk of bugs by simplifying the interface
- for example, would allow the removal of this logic: https://github.com/microsoft/LightGBM/blob/8a90ea3f267a81a529e3f069cc13e0f6320e7989/R-package/R/lgb.Dataset.R#L51-L64
- and would remove the need to worry about problems like "what happens if you provide init_score as an argument passed through ... and a different init_score in the info list?"
adding deprecation warnings now, plus support for the pattern we want to support from 4.0.0 onwards, gives users time (probably on the order of months) to change their code before the breaking changes in 4.0.0 are released

Description

LightGBM training involves some preprocessing like bucketing continuous features into histograms and filtering out unsplittable features. That work is done one time before training begins, in the construction of a Dataset object.

In addition to the raw data (i.e. features) used, LightGBM Dataset objects can also contain the following:

label = an array of values for the target (e.g. 0s and 1s for binary classification)
weight = an array of sample weights, used to tell LightGBM that some samples should be considered more important during training
group = a vector of integers, describing how samples should be grouped together into "query results" (only relevant in the learning-to-rank task)
init_score = a matrix of per-sample initial scores to boost from. This can be used, for example, to start the boosting process from predictions created by another model.

References

implementation of Dataset class on the Python side: https://github.com/microsoft/LightGBM/blob/8a90ea3f267a81a529e3f069cc13e0f6320e7989/python-package/lightgbm/basic.py#L1122-L1128

Other Notes

Sorry I didn't write this up sooner. Didn't really think of it until I started working on adding deprecation warnings for uses of ... (e.g. in #4522).

@Laurae2 and I have already talked about this privately, although would still like to open this as a Request for Comment (RFC) to give everyone who's interested a chance to voice their opinions.

Laurae2 commented 3 years ago

Agree with all the proposed changes, not only this will make it easier to maintain but also make it easier for users to work with. 👍

jameslamb commented 2 years ago

This work is now complete. See the list of linked pull requests above for details.

Thanks very much @StrikerRUS for thorough reviews of so many PRs!

StrikerRUS commented 2 years ago

@jameslamb Thanks a lot for splitting the work into many multiple small PRs! It was a pleasure to review them.

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

microsoft / LightGBM