shap / shap

A game theoretic approach to explain the output of any machine learning model.
https://shap.readthedocs.io
MIT License
22.46k stars 3.25k forks source link

BUG: TreeSHAP Interventional explanations segmentation fault #3486

Open amir-rahnama opened 7 months ago

amir-rahnama commented 7 months ago

Issue Description

When you explain RandomForest models (and sometimes even GradientBoostingTrees with Sklearn) using TreeSHAP explainer, tree-shap script breaks.

I was assuming that this was fixed in BUG: Interventional TreeSHAP failing for large depth tree-based models, but this is still recurring in SHAP version 0.44.1 and even 0.43.0.

Colab link: https://colab.research.google.com/drive/1bAH-4WclIJQPvPwoY3cZ-UZu8MAccaTO?usp=sharing

Minimal Reproducible Example

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.datasets import fetch_openml
from joblib import dump, load
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import argparse
import numpy as np
import pandas as pd
import sys
import os
import pickle
import scipy
from sklearn.preprocessing import StandardScaler
import sklearn
import collections
import shap

random_state = 10
np.random.seed(random_state)

dset_id = 293

ames = fetch_openml(data_id = dset_id, as_frame='auto')
random_state = 10
ames.target[ames.target == '-1'] = 0
ames.target[ames.target == '1'] = 1

target = ames.target.astype(int)
ames.data = ames.data.toarray()

if np.sum(np.isnan(ames.data)):
    feat_col_means = np.nanmean(ames.data, axis=0)
    ames.data = np.where(np.isnan(ames.data), feat_col_means, ames.data)

X_train, X_test, y_train, y_test = train_test_split(ames.data, target, test_size=0.33, random_state=10)

model_name = 'rf'

if model_name == 'rf':
    model = RandomForestClassifier(random_state=random_state)
else: 
    model = GradientBoostingClassifier(random_state=random_state)

model.fit(X_train, y_train)

shap_explainer = shap.TreeExplainer(model, X_train, 
                                    feature_perturbation="interventional",
                                    model_output='probability')

def tree_shap_exp(instances, x_train, model_obj, x_test):
    shap_values = shap_explainer.shap_values(instances, check_additivity=False)
    shap_values = np.array(shap_values)

    return shap_values

res = tree_shap_exp(X_test[:10], X_train, model, X_test)

Traceback

Segmentation fault (core dumped)

Expected Behavior

I expect TreeSHAP to work since neither the dataset nor the tree model is large enough not to fit into the memory.

Bug report checklist

Installed Versions

amir-rahnama commented 7 months ago

Here is the crash logs on Google Colab:

Screenshot 2024-02-06 at 11 59 27
amir-rahnama commented 7 months ago

UPDATE: If you set max_depth=30 for training the RF, the issue will go be fixed.

Possible other issues related to this:

CloseChoice commented 6 months ago

Thanks for your report. I remember that we had the problem previously. Seems like a deep bug in our C code

stergioa commented 3 months ago

Thanks for your report. I remember that we had the problem previously. Seems like a deep bug in our C code

Any fix?

CloseChoice commented 3 months ago

@stergioa thanks for the gentle push. AFAIK nobody is working on this. Just checked the example and it doesn't throw an error on my Windows PC. We would certainly need one first before we can start working on this.