iancovert / fastshap

An amortized approach for calculating local Shapley value explanations
MIT License
92 stars 19 forks source link

variability in Fastshap values #1

Closed ghanasyam-rallapalli closed 3 years ago

ghanasyam-rallapalli commented 3 years ago

Hello @iancovert I have been reading all of your contribution on Shpley values and model explainability and grateful for your contribution in the field

Now i have tried to test fastshap on a different binary classification dataset and found that fastshap values vary and differ a lot to converged Shapley values and often in the opposite direction see one of the plot included below image

Any thoughts or suggestions on how i could try to improve their consistency notebook code is here

iancovert commented 3 years ago

Hi Ghanasyam, thanks very much for checking out the packages! I'm going to give a pretty detailed answer here in case others encounter a similar issue.

So yes, in the example you've shown it's clear that FastSHAP is not giving good estimates of the ground truth Shapley values. It's actually pretty interesting that none of the four methods are that close together, and it may be helpful to briefly explain why before getting to FastSHAP.

First, we have the ground truth Shapley values that are calculated by running KernelSHAP to convergence; this is what we want everything else to be close to. Second, we have KernelSHAP run with only 128 model evaluations, which is too few iterations to achieve accurate estimates. Next, we have KernelSHAP from the SHAP package, which could be diverging from our ground truth for two reasons: 1) estimation error because of an insufficient number of iterations (KernelSHAP uses a default number of iterations), and 2) because the SHAP package is using a different feature removal strategy (in your notebook, the k-means approach).

Finally, there's FastSHAP. FastSHAP is trained using the same feature removal approach as our ground truth (a surrogate model trained to accommodate missing features), so it should provide accurate estimates - but it doesn't. The easiest way to understand this is that the FastSHAP explainer, a predictive deep learning model, is not getting good performance on its test set. (In this case, it's the test set because you're explaining a test example.)

In general, this can happen for a couple reasons:

In your case, because the Wisconsin Breast Cancer dataset is rather small (569 examples, only a subset of which are used for training), I think the overfitting case is possible. But it's hard to be sure, so here's a list of things you may try.

Hopefully that all makes sense, let me know how it goes.

ghanasyam-rallapalli commented 3 years ago

Thank you @iancovert for your quick response and detailed reply to my question you are right, this dataset is rather small on no of instances to train, i should have thought about this early on

I wanted to test FastSHAP on a slightly larger no. of feature dataset and see how well it compares to other approaches and exactly for the reason you mentioned For KernelSHAP from SHAP I have only sampled 5000 combinations out of 1 billion combinations of those 30 features. I used shap.kmeans as per SHAP package recommendation to increase sampling. I went up to 1 million samples to see if SHAP KernelSHAP values match up to your "optimised" KernelSHAP regression estimations, still found the direction of Shapley values to be opposite That could be attributed to the dataset.

So will try to find slight better dataset with higher no. of features and get back to you

And thanks a lot for providing those really useful tips and pointers to consider

iancovert commented 3 years ago

Hi there - I think if you're continuing to see differences between the SHAP and shapley-regression versions of KernelSHAP, it's almost certainly because of how the explanations are holding out features. If one version uses the surrogate model but the other uses k-means to determine a small number of background samples, the results from the two methods will not agree even when computed with a million samples.

But anyway, I also think that a larger dataset is worth trying. I'm going to close this issue for now but we can reopen later if necessary.