Investigate possibility of borrowing some features/ideas from Explainable Boosted Machines

JoshuaC3 commented 3 years ago

Summary

Borrow ideas from InterpretMLs Explainable Boosting Machine to make LGBM more interpretable as well as more comparable to their EBM.

Description

I am sure you are somewhat aware of InterpretMLs Explainable Boosting Machine - also a Microsoft innovation!

I have been a long time user of LGBM and now an recent fan of EBM. Clearly, they are similar in ways and each have their strong points and weaknesses. There are trade-offs of choosing either. That said, I have been playing around with some settings in LGBM to make the final models behave more like the EBMs. The reasons for this are two fold: 1) Increase LGBM interpretability 2) Allow more direct comparisons of results/functionality.

Some tips, tricks and findings are as follows:

Setting n_estimators high, learning_rate low, and num_leaves low, begins to mimic some of the behaviours of the EBM. That is, iteratively building LOTS of VERY shallow trees, VERY slowly. I add feature constraints so that each tree is univariate. Pairwise interactions could also be added where needed. This allows the model to learn incrementally small amounts from each feature, rather like the EBM. Essentially, replicating a model of the form:

lgb = F0(X0) + F1(X1) + ... + Fn(Xn)

which is effectively a GLM/GAM.

lgb = LGBMRegressor(
    n_estimators=5000, #large
    learning_rate=0.01, #small
    num_leaves=4, #shallow
    interaction_constraints=[[i] for i in range(X.shape[1])], #[[i, j] for i in range(X.shape[1]) for j in range(X.shape[1]) if i != j]
#     feature_contri=[10 for i in range(X.shape[1])], #use to reduce single feature dependence?
#     monotone_constraints=[1 for i in range(X.shape[1] - 1)] + [-1], #use for expert knowledge. I used in my temperature vs gas use case. temp -> gas monotonically decreasing.
)

To me, however, there seems to be two relatively simple features that could be added to facilitate this further. 1) The addition of a Constant Term. 2) Schedulable Feature Cycling.

The Constant Term

This would allow the model to remove the extra/redundant initial "value"* from the first set of splits on the first feature. This is the same as shifting your y-variable (e.g. lgb.fit(X, y - C), but until the initial run, you do not know what C is.

Schedulable Feature Cycling

This could be done in many ways, but essentially, all that is needed is the ability to cycle through each feature in-turn. With the current model params this is done at random. Combined with the above issue, this means that any features disproportionately picked at random in the first few trees would have an artificially inflated "value"*.

Hopefully it is clear how this would help improve the understanding of how an LGBM model is behaving.

This is aimed at being an ongoing discussion, so please chime in. Any questions, please ask!!

Motivation

1) Increase LGBM interpretability. 2) Allow more direct comparisons of results/functionality. 3) Borrow other idea from EBMs. 4) Open discussion on how to do this.

References

*"value" as defined in the table generated by: tree = lgb.booster_.trees_to_dataframe(). InterpretML: A Unified Framework for Machine Learning Interpretability InterpretML: A toolkit for understanding machine learning models InterpretMLs Explainable Boosting Machine

StrikerRUS commented 3 years ago

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

JoshuaC3 commented 3 years ago

The Constant Term This would allow the model to remove the extra/redundant initial "value"* from the first set of splits on the first feature. This is the same as shifting your y-variable (e.g. lgb.fit(X, y - C), but until the initial run, you do not know what C is.

This constant term is simply the mean of y in EBMs. Is this just setting init_score to be the mean of y for all observations?

JoshuaC3 commented 3 years ago

Schedulable Feature Cycling This could be done in many ways, but essentially, all that is needed is the ability to cycle through each feature in-turn. With the current model params this is done at random. Combined with the above issue, this means that any features disproportionately picked at random in the first few trees would have an artificially inflated "value"*.

@StrikerRUS this is conceptually a very easy thing to implement, though imagine it to be at the internal level (C code level). It would simply require a parameter, lets say feature_sampling, and the options to be random or cyclic.

Can I add this as a separate feature request and then add it under the New features section? I feel it would be incredibly easy to implement so being under New Algorithm makes it seem a bit too big a task.

StrikerRUS commented 3 years ago

@JoshuaC3

This constant term is simply the mean of y in EBMs.

For mean value I think boost_from_average param should help. For any other custom values init_score can be used, as you correctly mentioned.

Can I add this as a separate feature request and then add it under the New features section? I feel it would be incredibly easy to implement so being under New Algorithm makes it seem a bit too big a task.

Sure! As you are quite familiar with EBM, feel free to split this big issue into multiple smaller separate feature requests which will be self-contained and will not require a lot of efforts for writing new code. I believe it will help to evolve more people into improvement process of LightGBM.

JoshuaC3 commented 3 years ago

@StrikerRUS Thanks! I will do this.

In Python, how do you return the init_score after the model is trained? I cannot find any functions or attributes for it. Thanks!!

StrikerRUS commented 3 years ago

Do you mean something like this?

import numpy as np
import lightgbm as lgb

from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)
lgb_data = lgb.Dataset(X, y, init_score=np.ones_like(y))
lgb_data.get_init_score()

In LightGBM init_score is tight with Dataset object.

JoshuaC3 commented 3 years ago

Precisely that. Thanks. I was looking on the Booster class and via the sklearn api.

I will add "Expose the get_init_score to the Booster class" as a feature request and also expose it in the sklearn api too. I think I might be able to do this as it all looks like pythons all the way up.

microsoft / LightGBM