Shap features add support to model pipelines

Matgrb commented 3 years ago

Problem Description Currently, when passing the model pipeline to any shap feature, we use SHAP explainer to explain it directly, which causes an error.

Desired Outcome All shap features e.g. ShapRFECV should support using sklearn and imblearn pipelines.

Solution Outline Before passing the model to shap.explainer in probatus.uitils.shap_helpers.shap_calccheck if the model is a model pipeline and if so, extract the last step from it. The support for the pipelines will be quite limited. It will only work if the previous steps of the pipeline do not perturb the feature names or number of features. For instance, if the user uses PCA to reduce number of features, this will cause that the model is not analysable using SHAP anymore.

Matgrb commented 3 years ago

I am considering implementing this issue. However, I see a couple of problems. Pipeline might apply different steps before running the model. It can be extracting features, rescaling the data, over or undersampling and many others. If you want to apply SHAP to analyse a pipeline you have to first apply all transformers of the pipeline on the input data, and then apply the estimator. So there are following side effects:

Data rescaling and simple transformations - different values of the features compared to the initial values. This might cause that the summary plot of the model or analysing specific samples will be significantly harder for the user, since we apply the transformations right before passing the model to SHAP
Feature extraction, selection, PCA, missing indicators etc - Error because we have new features, and we don't have names for these features. I also think we should not try to find the names for these features, because this will be very hard to implement inside ShapRFECV
Some features dropped in the pipeline - Error

Having in mind that we are able to support the pipelines partially for SHAP features, and when we do, the results might be misleading for the user, I think the user should apply all the transformations before passing it to e.g. ShapRFECV, Model interpret or ShapResemblance model.

This has certain drawbacks in case the user wants to use methods that require passing X1 and X2 or X_train and X_test, because then the user has to apply the transformers correctly on these sets. We might want to add a docs page, in which we explain how to handle model pipelines for different use cases. And throw an informative error if the model pipeline is passed to Shap.Explainer

@timvink @sbjelogr what do you think?

sbjelogr commented 3 years ago

You have a good point, @Matgrb.

Indeed shap values are well defined for a model. The transformers might mess-up the interpretation quite a bit.

In the first point (Data rescaling and simple transformations), as long as the transformations are linear (say MinMaxScaler), it's easy to translate the interpretations (even with a summary plot). However, in this category you might have some other transformers (like the imputers, FunctionTransormers, or any categorical encoder that encodes the target information (Example WoeEncoders), that are not so easy to interpret.

In addition, depending on how you use imblearn transfromers on a training and validation set, you might have very misleading shap interpretations (recall that shap always calculates the marginal contribution of a feature relative to the average target distribution on your sample, and imblearn transformers are probably gonna mess up the average distribution of the sample, hence affecting the shap interpretations)

I would suggest to have a warning being raised by default in the first case (that might be turned off by a verbosity flag for example), something like

def func_name(...,verbosity=1,):

  if verbosity >1:
      warnings.warn("Shap interpretation might not be well defined with a pipeline. The recommended approach is to first apply the transformers, and then use the function with the last step of the pipeline"

In addition, all the three points you mention definitely an issue when you need to use shap for interpretation purposes.

However, in the context of feature selection, with ShapRFECV, probably you might be able to have looser requirements on which transformers you reject by raising an error)

Matgrb commented 3 years ago

I think for now it will be safer to just not support model pipelines and ask users to apply it beforehand.

Indeed ShapRFECV could have looser requirements, let's see if users of the package make more issues about it, and how they want to use it in their example code.

For now I added an error with a message explaining that we don't support it and they need to apply the transformers beforehand.

ing-bank / probatus

Shap features add support to model pipelines #128