nredell / shapFlex

An R package for computing asymmetric Shapley values to assess causality in any trained machine learning model
Other
71 stars 7 forks source link

Result from symmetric shap does NOT match with the SHAP package #12

Open gtmdotme opened 4 years ago

gtmdotme commented 4 years ago

I dumped the adult_dataset that you mention in ReadMe into a csv and run a RandomForestClassifier with almost same settings and calculate shap values from the SHAP package library in python (as given by the author of SHAP paper). I then compare these results with the symmetric counterpart of your library.

  1. As a result, I can't see the same (even approximately) set of global shapley values.
  2. Also I don't understand for calculating the global shapley value, you find the mean of shapley values for every instance while the SHAP paper suggests doing an mean of absolute of those shapley values.

Pseudo Code:

import numpy as np
import pandas as pd
import shap
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('adult_dataset.csv')
# encode categorical variables and get features and labels in X, y
X, y = preprocess(df)
model = RandomForestClassifier(max_depth=6, random_state=0, n_estimators=300)
model.fit(X, y)

shap.initjs()
explainer= shap.TreeExplainer(model, data=X)
shap_values = explainer.shap_values(X)

# Global shapley values
gsv = np.mean(np.abs(shap_values[1]), axis=0)

As a side note, TreeExplainer finds exact shap values but your results don't match with KernelExplainer even.

Thanks in advance.

nredell commented 4 years ago

I think I know what's going on here. I'll give it a go soon.

gtmdotme commented 4 years ago

Reading the paper for Asymmetric Shapley Values again, I realised that the definition of Global Shapley values is different for both.

  1. Asymmetric Shap paper says to get the expectation (or equivalently mean) of individual shap values.
  2. Symmetric Shap paper says to get mean of absolute shap values.

But with both of these defitions, the shap values vary a lot for the two implementations of symmetric shap (the one that your repo implements and the one from SHAP package)

nredell commented 4 years ago

Alright. Disclaimer: I've been putting my open source dev time into other projects lately. This package is still experimental, but I plan on revisiting it in a very dedicated way in a couple weeks.

The first check I did was below. If you run the code from this vignette and then run the code below, you'll get this plot. I've connected the same explained instances with a black line to get a better sense of the separation. The agreement is fairly strong. I'm trusting that the good folks at catboost have a solid implementation of TreeSHAP. The comparison is in log-odds space, however...(continued below image).

data_plot <- data_all[!data_all$feature_name %in% names(cat_features), ]

data_plot <- tidyr::pivot_longer(data_plot, cols = c("shap_effect", "shap_effect_catboost"), 
                                 names_to = "algorithm", values_to = "shap_effect")

data_plot$feature_value <- as.numeric(as.character(data_plot$feature_value))

p <- ggplot(data_plot, aes(feature_value, shap_effect, color = algorithm, group = index))
p <- p + geom_point(alpha = .25)
p <- p + geom_line(color = "black")
p <- p + facet_wrap(~ feature_name, scales = "free")
p <- p + theme_bw() + xlab("Feature values") + ylab("Shapley values") + 
  theme(axis.title = element_text(face = "bold"), legend.position = "bottom") + labs(color = NULL)
p

algorithm_comparison

shap is an awesome and much more fully featured package (I'm going to be more focused on causality in shapFlex when I get back to it). It does several things behind the scenes; namely, there is a special constraint algorithm that converts the log-odds space to the probability space while keeping the additivity property of Shapley values so the sum of the feature-level Shapley values equals 1. I don't have this correction anywhere which is more of a problem with classification than regression. This is likely the main difference.

A less important difference is that Shapley value calculations are model dependent...the Random Forest implementations here differ a fair bit in the details. Still, I would expect the Shapley values to be highly correlated. I'll produce the same plot above but in probability space to see where things may be off.