interpretml / interpret

Fit interpretable models. Explain blackbox machine learning.
https://interpret.ml/docs
MIT License
6.22k stars 726 forks source link

R: intercept may not be handled correctly #417

Open tammandres opened 1 year ago

tammandres commented 1 year ago

Hi all,

I have enjoyed using the python package, but also needed the R package. I noticed that predicted probabilities from the EBM model created with R are too high when the dataset is imbalanced. I wonder if an intercept term may be missing from the model and/or from the predict_proba function? It looks like the ebm_predict_proba function simply adds together the contributions from each feature function, and applies the sigmoid function, but no intercept is added before applying the sigmoid.

Example: a dummy imbalanced dataset with 10% of positive cases. Average predicted probability is 0.39, which is much higher than 0.1.

df <- data.frame(x=seq(1, 100), y=c(rep(0, 90), rep(1, 10)))
clf <- ebm_classify(df['x'], df$y)
prob <- ebm_predict_proba(clf, df['x'])
print(mean(prob))
print(mean(df$y))

> print(mean(prob))
[1] 0.3949367
> print(mean(df$y))
[1] 0.1

Many thanks, Andres

paulbkoch commented 1 year ago

Hi @tammandres -- I'll have to look into the imbalanced issue, but I'm pretty sure it isn't being caused by the lack of an intercept. EBM, like all GAMs, allow you to shift weight between features. To make the models identifiable, GAMs typically center each graph such that the average contribution of each graph is zero, and they move the weight to the intercept. This operation does not change the model's predictions. You can see this operation here in the python:

https://github.com/interpretml/interpret/blob/e1abdd12ddb3f255b3d75293b326dae546bcf668/python/interpret-core/interpret/glassbox/ebm/utils.py#L281

The R package does not currently center the graphs, but this should not lead to a change in the predictions.

tammandres commented 1 year ago

Hi @paulbkoch, many thanks for the response and the explanation! I am including the same code snippet in Python, which gives an expected result -- i.e. that the average predicted probability in an imbalanced dataset is roughly the proportion of positive cases:

import numpy as np
import pandas as pd
import interpret
from interpret.glassbox import ExplainableBoostingClassifier

df = pd.DataFrame(zip(np.arange(1, 101), np.append(np.zeros(90), np.ones(10))), 
                  columns=['x', 'y'])
clf = ExplainableBoostingClassifier()
clf.fit(df[['x']], df.y)
prob = clf.predict_proba(df[['x']])[:, 1]

print('Mean predicted prob: {}'.format(prob.mean()))
print('Mean observed prob: {}'.format(df.y.mean()))
print('Package version: {}'.format(interpret.__version__))

>Mean predicted prob: 0.10025197868529358
Mean observed prob: 0.1
Package version: 0.3.2