microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.73k stars 3.84k forks source link

Enhancing the Flexibility of Linear Models in Leaf Nodes of Boosted Linear Trees #6630

Open ToddMeng opened 3 months ago

ToddMeng commented 3 months ago

Summary

Enhancing the Flexibility of Linear Models in Leaf Nodes of Boosted Linear Trees

Motivation

Linear trees represent a practical technique that not only enhances model performance and simplifies model structure but also improves model interpretability. When working with linear models, users often need to impose numerous custom constraints to enhance interpretability and incorporate additional prior knowledge. These constraints may include restricting all regression coefficients to be positive, defining the monotonicity of each variable, and limiting the linear regression to a subset of selected features.

Description

As a regular user of this library, I am deeply grateful for the diligent efforts of all developers and maintainers, whose hard work has greatly facilitated our work. Upon a thorough review of the documentation and the linear_tree_learner.cpp code (link: https://github.com/microsoft/LightGBM/blob/master/src/treelearner/linear_tree_learner.cpp), I have observed that, apart from the ridge regression parameters, the linear model component lacks support for other features, such as the aforementioned constraints on the signs of regression coefficients and the capability to include only a subset of features in the linear regression.

References

It is proposed that the functionality extensions of linear models in sklearn could be referenced, or an interface could be provided to enable users to customize linear models, thereby enhancing the flexibility and practicality of linear tree models.

jaguerrerod commented 2 months ago

Related to this, I think adding the option to include some predictors in all linear models, in addition to the predictors used in the splits to reach the leaf, is important. I have datasets containing data from several population segments, and I am not interested in including the variables that define the segments in the model itself. However, I would like to include an adjustment in the prediction using the segment flags in the linear model fitted to each leaf. My leaves have more than 20K observations, so including this segment adjustment does not pose an overfitting problem. This option could be set through a parameter, 'features_forced_to_leaf_linear_model', as an array of feature indices or feature names. I think this wouldn't be complex to implement, but I don't have the necessary C++ skills to do it.