Closed Matgrb closed 3 years ago
I am considering implementing this issue. However, I see a couple of problems. Pipeline might apply different steps before running the model. It can be extracting features, rescaling the data, over or undersampling and many others. If you want to apply SHAP to analyse a pipeline you have to first apply all transformers of the pipeline on the input data, and then apply the estimator. So there are following side effects:
Having in mind that we are able to support the pipelines partially for SHAP features, and when we do, the results might be misleading for the user, I think the user should apply all the transformations before passing it to e.g. ShapRFECV, Model interpret or ShapResemblance model.
This has certain drawbacks in case the user wants to use methods that require passing X1
and X2
or X_train
and X_test
, because then the user has to apply the transformers correctly on these sets. We might want to add a docs page, in which we explain how to handle model pipelines for different use cases. And throw an informative error if the model pipeline is passed to Shap.Explainer
@timvink @sbjelogr what do you think?
You have a good point, @Matgrb.
Indeed shap values are well defined for a model. The transformers might mess-up the interpretation quite a bit.
In the first point (Data rescaling and simple transformations
), as long as the transformations are linear (say MinMaxScaler), it's easy to translate the interpretations (even with a summary plot).
However, in this category you might have some other transformers (like the imputers, FunctionTransormers, or any categorical encoder that encodes the target information (Example WoeEncoders), that are not so easy to interpret.
In addition, depending on how you use imblearn
transfromers on a training and validation set, you might have very misleading shap interpretations (recall that shap always calculates the marginal contribution of a feature relative to the average target distribution on your sample, and imblearn transformers are probably gonna mess up the average distribution of the sample, hence affecting the shap interpretations)
I would suggest to have a warning being raised by default in the first case (that might be turned off by a verbosity flag for example), something like
def func_name(...,verbosity=1,):
if verbosity >1:
warnings.warn("Shap interpretation might not be well defined with a pipeline. The recommended approach is to first apply the transformers, and then use the function with the last step of the pipeline"
In addition, all the three points you mention definitely an issue when you need to use shap
for interpretation purposes.
However, in the context of feature selection, with ShapRFECV, probably you might be able to have looser requirements on which transformers you reject by raising an error)
I think for now it will be safer to just not support model pipelines and ask users to apply it beforehand.
Indeed ShapRFECV could have looser requirements, let's see if users of the package make more issues about it, and how they want to use it in their example code.
For now I added an error with a message explaining that we don't support it and they need to apply the transformers beforehand.
Problem Description Currently, when passing the model pipeline to any shap feature, we use SHAP explainer to explain it directly, which causes an error.
Desired Outcome All shap features e.g. ShapRFECV should support using sklearn and imblearn pipelines.
Solution Outline Before passing the model to shap.explainer in
probatus.uitils.shap_helpers.shap_calc
check if the model is a model pipeline and if so, extract the last step from it. The support for the pipelines will be quite limited. It will only work if the previous steps of the pipeline do not perturb the feature names or number of features. For instance, if the user uses PCA to reduce number of features, this will cause that the model is not analysable using SHAP anymore.