variability in Fastshap values

ghanasyam-rallapalli commented 3 years ago

Hello @iancovert I have been reading all of your contribution on Shpley values and model explainability and grateful for your contribution in the field

Now i have tried to test fastshap on a different binary classification dataset and found that fastshap values vary and differ a lot to converged Shapley values and often in the opposite direction see one of the plot included below

Any thoughts or suggestions on how i could try to improve their consistency notebook code is here

iancovert commented 3 years ago

Hi Ghanasyam, thanks very much for checking out the packages! I'm going to give a pretty detailed answer here in case others encounter a similar issue.

So yes, in the example you've shown it's clear that FastSHAP is not giving good estimates of the ground truth Shapley values. It's actually pretty interesting that none of the four methods are that close together, and it may be helpful to briefly explain why before getting to FastSHAP.

First, we have the ground truth Shapley values that are calculated by running KernelSHAP to convergence; this is what we want everything else to be close to. Second, we have KernelSHAP run with only 128 model evaluations, which is too few iterations to achieve accurate estimates. Next, we have KernelSHAP from the SHAP package, which could be diverging from our ground truth for two reasons: 1) estimation error because of an insufficient number of iterations (KernelSHAP uses a default number of iterations), and 2) because the SHAP package is using a different feature removal strategy (in your notebook, the k-means approach).

Finally, there's FastSHAP. FastSHAP is trained using the same feature removal approach as our ground truth (a surrogate model trained to accommodate missing features), so it should provide accurate estimates - but it doesn't. The easiest way to understand this is that the FastSHAP explainer, a predictive deep learning model, is not getting good performance on its test set. (In this case, it's the test set because you're explaining a test example.)

In general, this can happen for a couple reasons:

FastSHAP could be underfitting, possibly because the model isn't expressive enough, because it isn't optimized to convergence, or because there was too much gradient variance.
FastSHAP could be overfitting, possibly because the train set is too small and it's able to memorize the training examples.

In your case, because the Wisconsin Breast Cancer dataset is rather small (569 examples, only a subset of which are used for training), I think the overfitting case is possible. But it's hard to be sure, so here's a list of things you may try.

Try comparing the methods on a training example instead of a test example. Is FastSHAP closer to the ground truth? If so, the explainer model may be overfitting. If not, it may be underfitting.
In case the model is underfitting, see if you can get it to train better with different hyperparameters. First, set the validation seed so you can compare the validation loss across runs (and possibly set the number of validation samples to something larger, like 128). Then, try playing with the minibatch size and the number of samples parameters, making them bigger to reduce the gradient variance. Try tuning the learning rate, seeing if a different initial learning rate or learning rate decrease factor gets you better validation performance.
In case the model is overfitting, that may be a hard problem to solve due to the small dataset size. But one thing to try is making the model smaller: try reducing the number of hidden units or the number of hidden layers to make the model less likely to memorize examples.

Hopefully that all makes sense, let me know how it goes.

ghanasyam-rallapalli commented 3 years ago

Thank you @iancovert for your quick response and detailed reply to my question you are right, this dataset is rather small on no of instances to train, i should have thought about this early on

I wanted to test FastSHAP on a slightly larger no. of feature dataset and see how well it compares to other approaches and exactly for the reason you mentioned For KernelSHAP from SHAP I have only sampled 5000 combinations out of 1 billion combinations of those 30 features. I used shap.kmeans as per SHAP package recommendation to increase sampling. I went up to 1 million samples to see if SHAP KernelSHAP values match up to your "optimised" KernelSHAP regression estimations, still found the direction of Shapley values to be opposite That could be attributed to the dataset.

So will try to find slight better dataset with higher no. of features and get back to you

And thanks a lot for providing those really useful tips and pointers to consider

iancovert commented 3 years ago

Hi there - I think if you're continuing to see differences between the SHAP and shapley-regression versions of KernelSHAP, it's almost certainly because of how the explanations are holding out features. If one version uses the surrogate model but the other uses k-means to determine a small number of background samples, the results from the two methods will not agree even when computed with a million samples.

But anyway, I also think that a larger dataset is worth trying. I'm going to close this issue for now but we can reopen later if necessary.

iancovert / fastshap

variability in Fastshap values #1