interpretml / interpret

Fit interpretable models. Explain blackbox machine learning.
https://interpret.ml/docs
MIT License
6.14k stars 724 forks source link

Computing p-values from EBM scores with z-test #309

Open abdjiber opened 2 years ago

abdjiber commented 2 years ago

Hello,

Thank you for making this great package! I'd like to know if we can compute the p-value of a given score/weight (logit from EBM global explanation) from a two-tailed z-test such is as in logistic regression?

                                         z-score = score / STD
                                         p-value = 2*Ф(-| z-score |)

where STD is the standard deviation of the score, Ф is the cumulative distribution function of a normal distribution, and | * | the absolute value.

If so, are expectations/means of scores for each feature considered to 0 because of E[f_j] = 0 in the following equation? image

Because the z-score is defined by:

                                          z-score = (score - m) / STD

where m corresponds to the mean score.

Thank you!

interpret-ml commented 2 years ago

Hi @abdjiber,

Thanks for the detailed question! Unfortunately, things are a bit more complex in the EBM setting when compared to linear models.

There's two different ways to look at significance testing in our setting: significance of an entire feature (e.g. "Age"), or significance of a region within a feature (e.g. 75 <= "Age" < 80). For each bin within a feature, we do store the sample mean and sample standard deviation derived from bagging. Given that (and an asymptotic normality assumption), we can use the method you described to produce a z-score (and corresponding p-value) per bin. This may be useful if you're interested in say assessing the significance of each score when looking at a prediction on a single sample.

However, things are unfortunately messier when we try to analyze an entire feature at a time. Repeating this procedure for every bin within a feature (which, by default, can be up to 256 bins) leads to a large multiple hypothesis testing problem, and there's also significant correlation between the means and standard deviations in nearby bins (because they are often learning on very similar segments of data).

We're not entirely sure of how to construct appropriate test statistics and p-values for entire features in our setting. A natural null hypothesis for a feature level significance test might be something like "every bin in the feature has mean zero". While there is some literature on significance testing for features in classic spline GAMs (e.g. Marra & Wood, 2011), it unfortunately doesn't translate well to our setting.

We'd love to collaborate with anyone who has ideas in this space!

-InterpretML Team

abdjiber commented 2 years ago

Hi InterpretML Team,

Thank you for the detailed response! In my use case, the significance per bin is sufficient. Most of my features are binary and I have a few that are continuous including age. For the latter, I'm interested mostly in the significance of some bins (low and high values).

I've used the scores and standard deviations produced by the EBM global explanations data and developed a tool that dynamically computes the p-values based on the z-scores.

image

It would be helpful to have the significance of scores from EBM as in the figure above.

As a personal thought, I think EBM explanations with p-values associated with each score give a more flexible way of understanding/interpreting a model. Compared to the logistic regression and other models, instead of having a unique equation and p-value per feature for classification/regression tasks, we can have as many equations and p-values as the number of classes and bins.

Thank you again!

zeydabadi commented 2 years ago

@abdjiber Would it be possible for you to share the tool you mentioned here?

... "I've used the scores and standard deviations produced by the EBM global explanations data and developed a tool that dynamically computes the p-values based on the z-scores."