Open gtmdotme opened 4 years ago
I think I know what's going on here. I'll give it a go soon.
Reading the paper for Asymmetric Shapley Values again, I realised that the definition of Global Shapley values is different for both.
But with both of these defitions, the shap values vary a lot for the two implementations of symmetric shap (the one that your repo implements and the one from SHAP package)
Alright. Disclaimer: I've been putting my open source dev time into other projects lately. This package is still experimental, but I plan on revisiting it in a very dedicated way in a couple weeks.
The first check I did was below. If you run the code from this vignette and then run the code below, you'll get this plot. I've connected the same explained instances with a black line to get a better sense of the separation. The agreement is fairly strong. I'm trusting that the good folks at catboost have a solid implementation of TreeSHAP. The comparison is in log-odds space, however...(continued below image).
data_plot <- data_all[!data_all$feature_name %in% names(cat_features), ]
data_plot <- tidyr::pivot_longer(data_plot, cols = c("shap_effect", "shap_effect_catboost"),
names_to = "algorithm", values_to = "shap_effect")
data_plot$feature_value <- as.numeric(as.character(data_plot$feature_value))
p <- ggplot(data_plot, aes(feature_value, shap_effect, color = algorithm, group = index))
p <- p + geom_point(alpha = .25)
p <- p + geom_line(color = "black")
p <- p + facet_wrap(~ feature_name, scales = "free")
p <- p + theme_bw() + xlab("Feature values") + ylab("Shapley values") +
theme(axis.title = element_text(face = "bold"), legend.position = "bottom") + labs(color = NULL)
p
shap
is an awesome and much more fully featured package (I'm going to be more focused on causality in shapFlex when I get back to it). It does several things behind the scenes; namely, there is a special constraint algorithm that converts the log-odds space to the probability space while keeping the additivity property of Shapley values so the sum of the feature-level Shapley values equals 1. I don't have this correction anywhere which is more of a problem with classification than regression. This is likely the main difference.
A less important difference is that Shapley value calculations are model dependent...the Random Forest implementations here differ a fair bit in the details. Still, I would expect the Shapley values to be highly correlated. I'll produce the same plot above but in probability space to see where things may be off.
I dumped the
adult_dataset
that you mention in ReadMe into a csv and run a RandomForestClassifier with almost same settings and calculate shap values from the SHAP package library in python (as given by the author of SHAP paper). I then compare these results with the symmetric counterpart of your library.Pseudo Code:
As a side note, TreeExplainer finds exact shap values but your results don't match with KernelExplainer even.
Thanks in advance.