interpretml / interpret

Fit interpretable models. Explain blackbox machine learning.
https://interpret.ml/docs
MIT License
6.26k stars 726 forks source link

EBM inside LightGBM? #22

Closed julioasotodv closed 5 years ago

julioasotodv commented 5 years ago

Hi!

Maybe this is a silly question (mostly because codebases are obviously different), but what do you think about integrating this library with LightGBM, since it is also a top-notch gradient boosting library by Microsoft?

Now that the project is small, this should be easier than trying to do the same later. In the long term, maintenance costs should go down IMHO...

BTW great job with this library. I absolutely love your approach.

interpret-ml commented 5 years ago

Hi @julioasotodv,

It's not a silly question at all. From the outside view it sounds like we share a lot in common with other gradient boosting algorithms, but internally things are quite different. For interpretability, we train each feature independently from the others in a round robin fashion, whereas traditional gradient boosting needs to consider multiple features at the same time. This minor difference causes our algorithms to be quite different in terms of how they are implemented/optimized. It isn't clear how much overlap there would therefore be between projects.

-InterpretML Team

zhangyingmath commented 2 months ago

I am very interested in a similar question, not from code maintenance/dev but from the algorithm and performance:

  1. algorithmically, how is G2AM different from light GBM restricted at max_depth=2? (I guess I can read the paper but jut wonder if there is an "executive summary in layman terms")
  2. same question, but on predictive power. I see comparisons on certain problems of EBM against light GBM full complexity model but not limited at max_depth=2. If EBM is better, how is it achieved?
paulbkoch commented 2 months ago

Hi @zhangyingmath -- The problem with using LightGBM depth 2 is that without further restrictions you can't control how many pairs are generated, so by default it will end up with a huge number. There are a number of strategies you can employ to avoid this, but they have disadvantages. EBMs are designed from the ground up to fit mains, and then progressively fit higher dimensions (not limited to just pairs). The FAST algorithm, which quickly identifies high value interactions, is a key aspect missing in LightGBM for this. We expose a public "measure_interactions" function which you could use with LightGBM to achieve a similar effect (https://interpret.ml/docs/python/api/measure_interactions.html)

I'm actually in the process of doing some new benchmarking now. The v0.5.1 release of InterpretML in Feb 2024 included significant predictive power improvements, so we needed an updated benchmark. Results are preliminary, but EBMs were better than XGBoost 53% of the time across regression, binary and multiclass classification. This is across 105 datasets from OpenML's CC18 benchmark and their AutoML regression benchmark. If you want LightGBM specific numbers, you can modify our benchmark notebook here.

The "why" are EBMs so good against unrestricted XGBoost (and probably LightGBM) is a very interesting question. I don't have definitive answers, but I have a couple of theories:

1) For LightGBM, the default num_leaves is 31, which has benefits and drawbacks. EBM in contrast have a default max_leaves of 3, which means EBMs have about 10 times more data in each leaf. This should allow EBMs to boost longer before reaching the point where they would overfit and trigger early stopping. You can set this parameter lower in LightGBM, but setting it too low prevents the algorithm from exploring the interaction space. EBMs do not have that issue since interaction detection is handled separately and not conflated with tree building.

2) For interaction detection and when boosting pairs, EBMs consider the cuts jointly when looking for the best cuts. This is impractical for normal tree-based algorithms since for each potential cut the algorithm would need to examine all other potential cuts in all other features. EBMs can do this since when boosting pairs we already know which two features will be involved, which allows the EBM algorithm to precompute the gradients and hessians for the discretized cells within the tensor of possible cuts. Computing the gain for all combinations of cuts is fast since we don't need to re-examine the original dataset. This has a subtle benefit for EBMs in that true interaction effect is not visible when splitting a main, so the true interaction effect isn't visible to LightGBM until after it has split on one of the features. When considering pair splits jointly, EBMs get to see the true pair effect without requiring extra exploration.

3) EBMs boost a lot more and with a lower learning rate than LightGBM or XGBoost. EBMs can afford this because traditional tree based boosting algorithms need to examine all features when considering the gain for each potential split. The round robin and semi-greedy algorithms that EBMs use does not waste time when calculating gains. Whenever the EBM algorithm examines a feature, it makes use of the work to make a small update. As an example, let's say you had a problem with 1000 features. LightGBM would build 100 trees (since num_iterations=100), and each tree would have 31 leaves (since num_leaves=31), or less. To build this, LightGBM needs to examine the data 1,000 100 (31 - 1) = 3,000,000 times. LightGBM has some optimizations to reduce the amount of data it needs to re-examine, but those optimizations are counterproductive for a mostly cyclic algorithm like EBMs. In contrast, when building just the mains of an EBM, the algorithm will generally make about 2,500 boosting steps on each feature before hitting early stopping, which means it examines the feature data 2,500 * 1,000 = 2,500,000 times, but each time it will take a boosting step, leading it to have 2.5 million trees. That's a lot more than the 100 trees in LightGBM, which in turn probably allows more nuanced details to be extracted.

There's probably more I could add, but I'll stop here. In general, I think that the prevailing wisdom that interactions are very important is exaggerated, and whether EBMs or another more traditional tree based GBM is better probably comes down to whether the dataset has more higher order interaction effect, or whether the above considerations are more important. In most other respects EBMs are very similar to other GBMs, which is why I think the numbers are often close.

zhangyingmath commented 2 months ago

Thanks very much for the very detailed answer, @paulbkoch ! Thanks for mentioning the interaction detection util (it seems that this just depends on the dataset, not a specific model. I think you meant that I could use the output of this tool to help other models to limit the number of interaction pairs, if so desired.) That seems very useful. Also I like your benchmark notebook. Now I know where to look if I want to explore further.

From your explanation, it does seem that EBM and the existing GBM algorithms are very different, and that there are reasons to believe that they may have different performances. That is very good to know.

paulbkoch commented 2 months ago

@zhangyingmath -- measure_interactions can be run directly on the dataset, and we do attempt to measure pure interaction effect when you do by purifying the tensors before making the interaction strengh calculations, however it is probably a better idea to first fit a mains model, and then pass that model into measure_interactions via the init_score parameter (the mains model can be passed in directly to this parameter). From the pairs identified you can then fit pairs with either LightGBM or an EBM, although EBMs should have an advantage when boosting on pairs due to the joint nature of the gain calculations as described above. One other option is to throw away the mains model at this point and build a model restricted to all the mains and the identified pairs. Training both mains and pairs together is something I'd like to try with EBMs, although I haven't gotten around to coding it yet.

There's an example notebook that shows how to handle some of these more advanced interaction scenarios: https://interpret.ml/docs/python/examples/custom-interactions.html

You might also be interested in this paper that discusses purification and isolating main effects from pairs and higher order interactions. We also expose a purification utility that can be used independently if you are interested in this aspect. https://proceedings.mlr.press/v108/lengerich20a/lengerich20a.pdf