Closed JoshuaC3 closed 3 years ago
Closed in favor of being in #2302. We decided to keep all feature requests in one place.
Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.
The Constant Term This would allow the model to remove the extra/redundant initial "value"* from the first set of splits on the first feature. This is the same as shifting your y-variable (e.g. lgb.fit(X, y - C), but until the initial run, you do not know what C is.
This constant term is simply the mean of y
in EBMs. Is this just setting init_score
to be the mean of y
for all observations?
Schedulable Feature Cycling This could be done in many ways, but essentially, all that is needed is the ability to cycle through each feature in-turn. With the current model params this is done at random. Combined with the above issue, this means that any features disproportionately picked at random in the first few trees would have an artificially inflated "value"*.
@StrikerRUS this is conceptually a very easy thing to implement, though imagine it to be at the internal level (C code level). It would simply require a parameter, lets say feature_sampling
, and the options to be random
or cyclic
.
Can I add this as a separate feature request and then add it under the New features section? I feel it would be incredibly easy to implement so being under New Algorithm makes it seem a bit too big a task.
@JoshuaC3
This constant term is simply the mean of
y
in EBMs.
For mean value I think boost_from_average
param should help. For any other custom values init_score
can be used, as you correctly mentioned.
Can I add this as a separate feature request and then add it under the New features section? I feel it would be incredibly easy to implement so being under New Algorithm makes it seem a bit too big a task.
Sure! As you are quite familiar with EBM, feel free to split this big issue into multiple smaller separate feature requests which will be self-contained and will not require a lot of efforts for writing new code. I believe it will help to evolve more people into improvement process of LightGBM.
@StrikerRUS Thanks! I will do this.
In Python, how do you return the init_score
after the model is trained? I cannot find any functions or attributes for it. Thanks!!
Do you mean something like this?
import numpy as np
import lightgbm as lgb
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
lgb_data = lgb.Dataset(X, y, init_score=np.ones_like(y))
lgb_data.get_init_score()
In LightGBM init_score
is tight with Dataset object.
Precisely that. Thanks. I was looking on the Booster class and via the sklearn api.
I will add "Expose the get_init_score to the Booster class" as a feature request and also expose it in the sklearn api too. I think I might be able to do this as it all looks like pythons all the way up.
Summary
Borrow ideas from InterpretMLs Explainable Boosting Machine to make LGBM more interpretable as well as more comparable to their EBM.
Description
I am sure you are somewhat aware of InterpretMLs Explainable Boosting Machine - also a Microsoft innovation!
I have been a long time user of LGBM and now an recent fan of EBM. Clearly, they are similar in ways and each have their strong points and weaknesses. There are trade-offs of choosing either. That said, I have been playing around with some settings in LGBM to make the final models behave more like the EBMs. The reasons for this are two fold: 1) Increase LGBM interpretability 2) Allow more direct comparisons of results/functionality.
Some tips, tricks and findings are as follows:
Setting
n_estimators
high,learning_rate
low, andnum_leaves
low, begins to mimic some of the behaviours of the EBM. That is, iteratively building LOTS of VERY shallow trees, VERY slowly. I add feature constraints so that each tree is univariate. Pairwise interactions could also be added where needed. This allows the model to learn incrementally small amounts from each feature, rather like the EBM. Essentially, replicating a model of the form:lgb = F0(X0) + F1(X1) + ... + Fn(Xn)
which is effectively a GLM/GAM.
To me, however, there seems to be two relatively simple features that could be added to facilitate this further. 1) The addition of a Constant Term. 2) Schedulable Feature Cycling.
The Constant Term
This would allow the model to remove the extra/redundant initial "value"* from the first set of splits on the first feature. This is the same as shifting your y-variable (e.g. lgb.fit(X, y - C), but until the initial run, you do not know what C is.
Schedulable Feature Cycling
This could be done in many ways, but essentially, all that is needed is the ability to cycle through each feature in-turn. With the current model params this is done at random. Combined with the above issue, this means that any features disproportionately picked at random in the first few trees would have an artificially inflated "value"*.
Hopefully it is clear how this would help improve the understanding of how an LGBM model is behaving.
This is aimed at being an ongoing discussion, so please chime in. Any questions, please ask!!
Motivation
1) Increase LGBM interpretability. 2) Allow more direct comparisons of results/functionality. 3) Borrow other idea from EBMs. 4) Open discussion on how to do this.
References
*"value" as defined in the table generated by:
tree = lgb.booster_.trees_to_dataframe()
. InterpretML: A Unified Framework for Machine Learning Interpretability InterpretML: A toolkit for understanding machine learning models InterpretMLs Explainable Boosting Machine