Metrics to measure Shap accuracy

dsvrsec commented 3 years ago

Are there any metrics available to measure shap explainability for a prediction

HughChen commented 3 years ago

You can check out Figure 3 of the TreeExplainer paper for some metrics (by accuracy I assume you mean how well the attributions capture model behavior).

Note that many of these metrics are "ablation" metrics that aim to remove/add features based on the the shap values (or feature attributions). For instance if a feature has a very high positive attribution, that means it positively impacts the model's output. If we remove that feature for a specific sample, then the model's output should probably decrease.

As a final note, (interventional) Shapley values in general should do very well for these metrics because ablation tests are close to the definition of Shapley values (which are the average marginal contribution of a feature). One additional subtlety is that many approaches are approximate solutions (sampling/IME explainer, kernel explainer, although Interventional Tree Explainer is exact) so it can be tough to disentangle whether your metric measures how good your approximation is or how good the feature attribution is.

barnardp commented 3 years ago

Hi Chen,

Thanks for the advice on the Tree Explainer paper! I just have a few questions: The paper mentions that there are python implementations of the benchmark metrics available at https://github.com/suinleelab/treeexplainer-study however I can't seem to find them here?

I've since implemented some of the metrics myself based on my understanding of their descriptions (including the arxiv version of the paper). One thing I've noticed however is that when I use the kernel explainer on more than say 13 features with a background of 200+ and nsamples=2^13 I still only seem to score 0.6 on the local accuracy metric. Is there a reason why this is happening as I thought the kernel explainer would be exact when I enumerate all 2^M subsets?

Also, im not entirely sure how I would check the consistency of the explainer, is there any examples available on how I could check for this - I dont mind the computational burden as I'm in an awkward situation where I'm required to rigorously evaluate my work despite the theoretical certainties of Shapley values.

Any help would be much appreciated!

Kindest regards, Pieter

HughChen commented 3 years ago

Hi Pieter,

The benchmarks are implemented here!

I believe that Kernel Explainer should satisfy local accuracy - have you tried with any of the other explainers? Two possible subtleties are (1) being careful about what output you explain for binary classification (probability or log-odds space) or (2) the l1_reg parameter in kernel explainer will regularize the number of important features, you can set it to l1_reg="num_features(13)".

If you are willing to incur exponential cost, you could also try SamplingExplainer which monte carlo samples the exponential subsets, although it may sample the same subsets multiple times so it may not be exact (in which case you may want to just implement the exact calculation yourself). A final subtlety is that the number of background samples you use means that you may actually need (# background samples)*2^13 samples to get exact solutions.

Checking consistency is awkward as you said because it should be guaranteed, but you can probably do it by defining a function v(S) that calculates the interventional conditional expectation (which is what Kernel/Sampling/Tree Explainer interventional calculate) of the function given an explicand and a background distribution.

barnardp commented 3 years ago

Hi @HughChen ,

Thanks very much for your help, my results have been much better since trying out your suggestions - somehow I wasn't aware that we have to multiply nsamples by the background size as well!

Also thanks for pointing out where to find the benchmarking code. Unfortunately I'm finding it hard to follow and decipher how the code actually works. Do you think perhaps it would be possible to create a notebook down the line that gives some brief overview on how to use the various functions in our own code?

I think these benchmarking methods can be extremely helpful not only for comparing various explainers (kernel, Sampling, linear, Tree etc) but also for helping us to determine the best 'hyperparameters' of these explainers, such as comparing different methods for creating a background dataset etc. One thing that I'm currently working on in my own version of these benchmarks is to calculate the theoretical optimal curves that are possible when the shapley values are 100% correct in identifying which features cause the model to increase decrease the most etc. as this would allow one to verify that there method is robust, independent on how it performs relative to other methods.

Kindest regards, Pieter

haochuan-li commented 2 years ago

any updates on the metrics?

barnardp commented 2 years ago

Hi Spidy,

I actually implemented my own version of these metrics for a paper that I was working on at the time (for a regression-based problem). In addition to implementing the mask metrics proposed in the Tree SHAP paper, I also implement my own version that provides an explicit ground truth that can be used to see how well explanations compare to the optimal case.

I tried my best to make the code legible and easy to follow at the time when I was writing it, however, I didn't intend to make it public. I've just created a repo where the files can be found (https://github.com/barnardp/Explainable-Resource-Reservation). Hopefully, they can provide you with some answers, or at least a good starting point for whatever you're working on. NB please note that calculating the ground truth for the mask metrics requires a lot of complexity, if I recall correctly it's close to the order of N!N^2 for most of them, so I suggest only running the code on one explanation first to see how long it takes!

Best regards, Pieter

github-actions[bot] commented 1 week ago

This issue has been inactive for two years, so it's been automatically marked as 'stale'.

We value your input! If this issue is still relevant, please leave a comment below. This will remove the 'stale' label and keep it open.

If there's no activity in the next 90 days the issue will be closed.

shap / shap

Metrics to measure Shap accuracy #1423