haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.02k stars 1.13k forks source link

TreeSHAP Values are inconsistent #670

Closed ntrost-targ closed 8 months ago

ntrost-targ commented 3 years ago

Describe the bug When calculating TreeSHAP values for random forest classification they dont add up. I would expect that the prediction from .vote() minus the respective SHAP Values gives me the base value which is constant and should be the same for different observations. Note that this is the behaviour we observe in lundbergs python module. Also it would be really handy if there were a function that just calculates the base value (expected_value in python) for me.

Expected behavior When calculating TreeSHAP Values I expect them to add up together with the base value to the predicted probability

Actual behavior Calculated Base Values vary even for observations from the same class

Code snippet

val iris = read.arff("../data/weka/iris.arff")

val formula: Formula = "class" ~
val x = formula.x(iris).toArray
val y = formula.y(iris).toIntArray

val model = smile.classification.randomForest(formula,iris)

val arr50 = new Array[Double](3)
val arr52 = new Array[Double](3)

model.vote(iris(50),arr50)
model.vote(iris(52),arr52)

val shap_50 = model.shap(iris(50))
val shap_52 = model.shap(iris(52))

arr50(1)-shap_50.indices.filter(x => (x+2) % 3 == 0).map(shap_50).sum
// res15: Double = 0.41123849878987584
arr52(1)-shap_52.indices.filter(x => (x+2) % 3 == 0).map(shap_52).sum
// res16: Double = 0.4571260444068466

Input data Iris data set

Additional context

haifengl commented 3 years ago

What if you use model.predict() instead of mode.vote()?

ntrost-targ commented 3 years ago

When i tried model.predict() gave me only the most probable class as output, not the class probabilities?

ntrost-targ commented 3 years ago

i also tried model.score() but that threw an error

haifengl commented 3 years ago

predict is overloaded. Try predict(x, prob) where prob is an array for output.

ntrost-targ commented 3 years ago

I tried, the problem persists, ableit with smaller variance. I now get 0.3267 vs 0.3289 - which is a more realistic value given the balanced 3 class data set. With lundbergs package that variance is much smaller.

haifengl commented 3 years ago

I don't understand this part:

I would expect that the prediction from .vote() minus the respective SHAP Values gives me the base value which is constant

Why? Do you have link to somewhere of paper to prove this?

mroettig commented 3 years ago

Hi,

I am a colleague of ntrost-targ. Our assumption comes from the local accuray / additivity of explainability (see https://ema.drwhy.ai/breakDown.html#BDMethodGen and Titanic sample with base value = 0.2353095 and posterior probabilities as f(x) ) which should give the posterior probability for any sample x as the sum of a common base-probability (or base shap value, being the mean class posterior over the full dataset) and of local attribution effects (or local shap values coming from model.shap(iris(50))) for a given sample x (here iris(50)).

That is phi(x) = phi_0 + \sum_i=1^M phi_i(x) = p(x) = class posterior

(see also https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7326367/#S10title , Property (1) ).

We just wanted to assert equality with the python lundberg implementation and did the reverse computation as posterior p(x) - \sum_i=1^M phi_i(x) = phi(x) - \sum_i=1^M phi_i(x) = phi_0 which should give us a somewhat constant phi_0 for all samples (modulo numerical issues). The python implementation gives always phi_0 (diverging from the fourth decimal place. The SMILE values start to diverge from the third decimal place). The range for phi_0 in one setting was [0.50026,0.50597]. Might be nitpicking here ;) We were just wondering.

Cheers, Marc

haifengl commented 3 years ago

Hi Marc, thanks for the explanation. Although not 100% sure, I think that this small difference comes from the smoothing of posteriori probability. Depending on the leaf node size, this smoothing may have slightly different impact on posteriori probability calculation.

If you choose two samples hitting the same leaf node, I guess that this difference will be smaller. It is hard to know if two samples arrive the same leaf node. As a work around, I suggest you to compute the difference on all the samples of one class. I guess that you will find several clusters of values with tiny difference.

ntrost-targ commented 3 years ago

Hi Haifeng,

i tried the same calculation but with Gradient Boosted Trees (smile.classification.gbm(formula,iris) ) but arrive at 0.59 vs 0.69 (which is also in absolute an odd value, i'm expecting roughly 0.33). Also the variance for the whole iris dataset in python is negligible on the order of 1e-16 - see the script below. For us the local explainability with shap is very important and we would be thankful if you take a deeper look into the issue. From what i'm thinking any numerical issues should be way smaller in variance than what i see here.

import shap
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
import numpy as np

iris = datasets.load_iris()
clf = RandomForestClassifier(max_depth=2, random_state=0)
explainer = shap.TreeExplainer(clf)

probs = clf.predict_proba(iris.data).transpose()[0]
shap_values = explainer.shap_values(iris.data)

b_0 = np.array([probs[i]-shap_values[0][i].sum() for i in range(150)])

b_0.std()
# 1.6922557229846184e-16

b_0.mean()
# 0.32906666666666645

explainer.expected_value
# array([0.32906667, 0.33373333, 0.3372    ])

notice how the backward calculation matches the explainer.expected_value for the first class.

Bests, Nikolaus :)

mroettig commented 3 years ago

Hi Haifeng,

I just came across your Commercial License Usage clause in the SMILE license when using SMILE in a commercial setting (i.e. incorporation of SMILE in commercial products). But I could not find any further details on the website regarding modalities and costs for the commercial license and setting.

Could you give us details on that topic and when commercial licensing is required ? And could we request a deeper look into our SHAP issue on your side when being commercial subscribers ?

Thanks a lot in advance + Cheers, Marc

haifengl commented 3 years ago

@mroettig please contact me by email.