Binning of data - Githubissues

piyushnegi97 commented 1 year ago

Thank you for this amazing work that you guys have done. I just wanted to understand further if it is possible to explicitly define the bins for a feature instead of leaving it to the code. Is there such a provision where we can define the bins? If there is no provision as such then can we do certain changes to include user-defined bins. Can you please guide me on this?

paulbkoch commented 1 year ago

Hi @piyushnegi97 -- Thank you for the kind words.

You can explicitly define bins through the feature_types parameter. https://interpret.ml/docs/ebm.html#explainableboostingclassifier

And there is an example of how to do it here: https://interpret.ml/docs/ebm-internals-multiclass.html

piyushnegi97 commented 1 year ago

Hi @paulbkoch ,

Thanks for your reply. There are a few other queries that I have regarding EBMs. Can you help me understand them better?

1) Is HPT tuning required for EBMs and will it help improve performance. I read a section in FAQs which mentions HPT is not recommended but can be done. Any use cases, comparisons, research involving HPT to understand the effect of HPs better?

2) SHAP is one of the go-to techniques for explainability. Any comparative studies , research on the results from explainability using EBMs vs SHAP?

3) Does EBM handle categorical data by itself and how?

piyushnegi97 commented 1 year ago

Hi @paulbkoch, I have another query and would really appreciate if you can help me with this as well:

My understanding of EBM is that it forms vectors of xi,f(xi) and then gets rid of the trees. This vector is eventually used as a shape function (step function) for any new predictions. Does EBM stores any equation for f(xi) in any form or this shape is the only thing it has (fixed logit value throughout the range of xi's)?
Also, f(xi) here is basically gradient boosted trees which in itself is blackbox so how does the f(xi) makes the entire thing Glassbox?

paulbkoch commented 1 year ago

Hi @piyushnegi97 -- Here are some answers to your questions:

As you mentioned, hyperparameter tuning is not really required. It probably helps a bit. I'm not aware of any papers that examine this though, and we haven't really done extensive experiments, so if you're interested please investigate.
There is a correspondence between SHAP and EBMs in that the SHAP values of an EBM without interactions are exactly equal to the EBM local explanations, which are also equal to the score values of the individual feature contributions of an EBM. KernelSHAP computations may differ slightly due to approximations, but the SHAP package also includes an additive explainer for EBMs (https://github.com/shap/shap/blob/master/shap/explainers/_additive.py) which simply returns the EBM local explanations. It does not currently support pairwise interactions, but if it did then the results would be theoretically identical to SHAP interaction values ( https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/Basic%20SHAP%20Interaction%20Value%20Example%20in%20XGBoost.html). So, basic SHAP values are an exact explanation of an EBM model without interactions, and interaction SHAP values are an exact explanation of an EBM with pairwise interactions. If you compute the basic SHAP values for an EBM with pairwise interactions though, SHAP will summarize the pairwise effect onto the individual features. How does this differ from other tree-based models? The default XGBoost model is built with max_depth=6 and n_estimators=100, which means that there are 2^5 * 100 = 3,200 potential 6-way interactions inside the default XGBoost model. The model would also have an even larger number of 5-way, 4-way, 3-way, and pairs that came from the purification distillates of the 6-way interactions. In theory all of these interactions could have their own SHAP values calculated for them, but such a complex explanation would not be understandable by humans, so for these models the best you can really do is summarize these effects onto individual features. In practice this works fairly well, however such information compression comes at the cost of exactness for the typical gradient boosted tree-based model.
Yes, EBMs handle categorical data. We prefer that you pass in categorical data as strings which will get shown in the graphs. Internally categorical data is handled as it is for any tree-based method where categories are placed on either side of each tree branch.
We throw away the trees during the boosting process. We don't even keep them between boosting steps. There are only four attributes that are actually required in order to make predictions from an EBM: "bins_", "termscores", "termfeatures", and "intercept_". termscores has the f(xi) values that you're asking about.
The glassbox-ness of EBMs comes from the fact that EBMs are a type of GAM model. If I gave you a printout of every feature's global explanation graph, you as a human could calculate the predictions by simply looking up the score value on each feature graph, and then summing those values. For regression that sum would be your prediction, and for classification it would be in logits. The part about EBMs being constructed from gradient boosted trees is just a description of how they are generated internally and is not related to their glassbox nature. If you're interested in the nitty-gritty details of how that process works, I'd direct you to the original KDD papers in 2012 and 2013 on our readme.

piyushnegi97 commented 1 year ago

Hi @paulbkoch . Thank you for helping me understand the above things. I have got a couple of more queries and would really appreciate if you can help me with them as well.

Just a thought- Can Shape function that we get in EBM can be used effectively as replacement for PDP? Both tell us how score varies with change in feature value so Shape function essentially tells us the same thing as PDP.
Is there any difference in the PDP process of EBM vs open source PDP? I believe the same method as mentioned in Molnar's book has been followed?
Any benchmarking/analysis that you people might have done regarding various GradientBoosting models (Xgb, LightGBM, Catboost) vs EBM performance?

Really grateful for your help.

paulbkoch commented 1 year ago

Hi @piyushnegi97 -- PDP doesn't quite work because there are almost always correlations between features which would lead to counting some of the effect twice if you were to use PDPs directly. The EBM construction method forces the model to choose which features to value more when correlation between features is present. This aspect isn't perfect, and affects all GAMs, however EBMs seem to do a fairly good job in practice of putting the most weight on the features that are better at predicting the target, and will generally split the effect when two features are roughly equally good.

We have some benchmarks on our readme, and many of the papers linked in our readme have independent evaluations.

piyushnegi97 commented 1 year ago

@paulbkoch So essentially EBM improves on PDP by choosing which feature to value more between two correlated features and splitting the effect instead of counting the effect twice as done in PDP?

Also, regarding my first point above, if we have a Black box model where we make use of PDP plots to see feature-score relationship then if we switch to EBM we can use Shape function plot in a similar way as a replacement for PDP? (Shape function in EBM analogous to Black Box PDP)

In the second point above I wanted to understand if the PDP library in InterpretML (for Blackbox) follows the same process as OpenSource available PDP package or the process mentioned in Molnar's book. Or is there any difference in how InterpretML calculates PDP for Blackbox models?

Will go through the papers mentioned in the readme section. Thanks.

paulbkoch commented 1 year ago

Yes, yes, and I'm not sure (I didn't write this part), but the InterpretML implementation seems to be based on the Friedman paper: https://interpret.ml/docs/pdp.html#friedman2001greedy-pdp

@Harsha-Nori or @nopdive would know more about the PDP implementation.

piyushnegi97 commented 1 year ago

Also, I wanted to understand about the following parameters. How do they effect the scores and what is their meaning?

Outer Bag - This I understand is used to generate error bounds which can be seen in the shape function
Inner Bag - ?
Greediness ?
Smoothing rounds ?

paulbkoch commented 1 year ago

1) We bag the dataset and generate models on each of those bags. This is the outer bagging. The error bounds on the graphs are the standard deviations of the outer bagged models, and the final model is the mean.

2) Each of the outer bagged subsets can themselves be bagged. This is the inner bagging. In round robin we visit each feature in order, and we generate a tree for each inner bag. The averaged updated is applied to the feature before moving to the next feature. To be honest, I'm not sure that inner bagging provides much benefit, and it might even be harmful if you have sufficient outer bagging. If you are going to try inner bagging, also try keeping it off.

3) If you set greediness to 0.0 (the current default), you get the original round robin cyclic boosting algorithm. Setting greediness to 1.0 will greedily boost on the features with the highest gain. Setting greediness to 0.5 will do one full round robin round where all features are boosted on, then it will do a greedy round where it will boost on the features with highest gain, and then it will repeat. Setting greediness to 0.75 will do 3 greedy rounds per round robin round. If you want to try this parameter, something in the neighborhood of 0.75 seems to work well.

4) Smoothing rounds are a set of highly regularized precursor boosting rounds that are done before commencing the regular boosting rounds. The main difference is that during the smoothing rounds we randomly choose where to put the tree splits. This works surprisingly well by itself (see the differential privacy EBM paper). The idea here is to establish the basic shape function for each feature before moving to more aggressive gain based boosting. If you set smoothing_rounds to the same value as max_rounds, you should get extremely smooth graphs that loose just a bit of performance. I recommend trying something in the range of 0-500 which allows the basic shapes to be formed, but follows with some sharpening. On some datasets a bit of smoothing seems to improve the model, but this effect is quite variable, so smoothing is one place where hyperparameter tuning might help.

piyushnegi97 commented 11 months ago

Hey. Thanks for your assistance. Your answers has been super helpful. Is there a way we can store the trees that are generated? Wanted to understand how EBM is learning from the data by deep diving into the trees.

paulbkoch commented 11 months ago

The trees have an ephemeral lifespan and exist entirely in C++. Here's the priority queue loop where the trees are generated (this section is just for the mains):

https://github.com/interpretml/interpret/blob/262d698d6346e20227971ebba8126b1bf26211d4/shared/libebm/PartitionOneDimensionalBoosting.cpp#L620-L690

Immediately below this loop we collapse the trees into a flattened representation:

https://github.com/interpretml/interpret/blob/262d698d6346e20227971ebba8126b1bf26211d4/shared/libebm/PartitionOneDimensionalBoosting.cpp#L708

In the python layer you can't get the original trees, but you can retrieve the flattened representation which preserves most of the same information except the order in which the tree cuts where chosen. The booster.get_term_update_splits function will retrieve where the tree cuts were made, and booster.get_term_update will retrieve the additive score update at each boosting step. These functions are not called when constructing typical EBMs since they are not needed. In this code they are called for building Differentially private EBMs, but you can modify the code slightly to call them and write out the information to a log if you want.

https://github.com/interpretml/interpret/blob/262d698d6346e20227971ebba8126b1bf26211d4/python/interpret-core/interpret/glassbox/_ebm/_boost.py#L95-L97

interpretml / interpret

Binning of data #471