predict-idlab / powershap

A power-full Shapley feature selection method.
Other
193 stars 18 forks source link

Is it possible to pass a sklearn Pipeline as model to `powershap`? #33

Open eduardokapp opened 1 year ago

eduardokapp commented 1 year ago

Is there currently a way to pass a sklearn.pipeline.Pipeline object as the model parameter? I can't seem to do it, and I think that being able to do it would be better for the internal cross validation.

For example, imagine that you have a model defined as a pipeline, that first takes one or two steps of preprocessing that may include operations that should not be done on the validation or test set eg. filling missing values with the mean.

Right now, I'm preprocessing my data and then passing just the pipeline final step to the powershap object.

jvdd commented 1 year ago

Hey @eduardokapp,

That is an excellent question! With the exact same train of thougt (feature selection should be included in pipeline for cross-validation), I've implemented powershap to be scikit-learn compatible.

You should be aware however that powershap performs a transformation (i.e., selecting features) and thus cannot be the final step in your scikit-learn pipeline. => your final step should be an estimator (some sort of model)

Dummy code of how this would look like :arrow_down:

pipe = Pipeline(
    [
        # ... (some more preprocessing / transformation steps)
        ("feature selection", PowerShap()),
        ("model", CatBoostClassifier()),
    ]
)

Hope this helps! If not, feel free to provide a minimal reproducible code :)

Cheers, Jeroen

eduardokapp commented 1 year ago

I'm not sure I understand what you're saying. I agree that powershap is a sort of sklearn transformer kinda object and, yes, it should be inside the pipeline!

However, what I don't really get is: if powershap has a "cv" parameter to pass a crossvalidator and, as I understand, powershap fits a model many times in its processing, wouldn't it be necessary for the model parameter in powershap to accept a pipeline and not just a model?

Hope I clarified my question! Thank you for your quick response.

jvdd commented 1 year ago

Oh, I see! My apologies for misinterpreting your question - looking back at the title I acknowledge you formulated it quite clearly :upside_down_face:

However, what I don't really get is: if powershap has a "cv" parameter to pass a crossvalidator and, as I understand, powershap fits a model many times in its processing, wouldn't it be necessary for the model parameter in powershap to accept a pipeline and not just a model?

Indeed, this would make sense! I see two options:

We currently comply with the 1st option. To some extent, supporting the 2nd option as well further limits data leakage. However, I am not 100% confident whether complying with this makes sense from an algorithmic standpoint, as we are then possibly not comparing apples to apples - the data will then change (slightly) over the folds when performing internal cross-validation. I do suspect this effect - if measurable at all - to be very minimal :thinking:

Interested in hearing your opinion about this @eduardokapp! Also @JarneVerhaeghe can you weigh in on this as well?

JarneVerhaeghe commented 1 year ago

Putting a scikit-learn pipeline in powershap is from an algorithmic standpoint a plausible option. Because we refit the preprocessors every powershap iteration, every feature in that iteration can be compared to each other. Furthermore, the label should be comparable across iterations, which in turn enables comparing the Shapley values because they will be in the same magnitude. The main concern where this could go wrong is where the resulting distributions after preprocessing are completely different across iterations. However, in cases such as normalization or a min-max scaler, both the label, the features, and the Shapley values will have the same magnitudes across iterations and therefore the algorithm will still perform adequately.

I hope this answers your question a bit @eduardokapp ?

eduardokapp commented 1 year ago

Thank you so much for taking the time to answer my question. So, given that this makes sense, what should be done (code-wise) to make it happen? I'd be happy to implement this feature.

eduardokapp commented 1 year ago

Hey @jvdd, @JarneVerhaeghe! I've been thinking about this issue and maybe it could be solved by creating a new explainer subclass that uses the ones you already defined but also applies the pipeline transformation steps.

Excuse me for the over-simplified ideas here, but something along the lines of:

class PipelineExplainer(ShapExplainer):
    @staticmethod
    def supports_model(model) -> bool:
        from sklearn.pipeline import Pipeline

        # Check if model is a Pipeline
        if not isinstance(model, Pipeline):
            return False

        # Get the final step (estimator) of the pipeline
        estimator = model.steps[-1][1]

        # Check if the final step is an instance of one of the supported models
        supported_models = [
            CatBoostRegressor, CatBoostClassifier,
            LGBMClassifier, LGBMRegressor,
            XGBClassifier, XGBRegressor,
            ForestRegressor, ForestClassifier, BaseGradientBoosting,
            LinearClassifierMixin, LinearModel, BaseSGD,
            tf.keras.Model
        ]
        return isinstance(estimator, tuple(supported_models))

    def _fit_get_shap(self, X_train, Y_train, X_val, Y_val, random_seed, **kwargs) -> np.array:
        # Get the final estimator from the pipeline
        estimator = self.model.steps[-1][1]

        # Fit the pipeline
        self.model.fit(X_train, Y_train, **kwargs)

        # Get the transformed data from all the preceding steps
        transformed_X_val = X_val
        for name, step in self.model.steps[:-1]:
            transformed_X_val = step.transform(transformed_X_val)

        # Calculate the shap values using the final estimator
        # maybe here that would be some way of just inheriting or modifying the behavior of the other classes
        explainer = shap.Explainer(estimator) 

        shap_values = explainer.shap_values(transformed_X_val)

        return shap_values

    def _validate_data(self, validate_data: Callable, X, y, **kwargs):
        # Validate the data for each step in the pipeline
        for name, step in self.model.steps[:-1]:
            X = step._validate_data(validate_data, X, **kwargs)
        return super()._validate_data(validate_data, X, y, **kwargs)