micahmelling / auto-shap

MIT License
11 stars 1 forks source link

auto-shap

The auto-shap library is your best friend when calculating SHAP values!

SHAP is a state-of-the-art technique for explaining model predictions. Model explanation can be valuable in many regards. For one, understanding how a model devised a prediction can engender trust. Conversely, it could inform us if our model is using features in a nonsensical or unrealistic way, potentially helping us to catch leakage issues. Likewise, feature importance scores can be useful for external presentations. For further details on SHAP values and their underlying mathematical properties, see the hyperlink at the beginning of this paragraph.

The Python SHAP library is often the go-to source for computing SHAP values. It's handy and can explain virtually any model we would like. However, we must be aware of the following when using the library.

Likewise, the native SHAP library does not take advantage of multiprocessing. The auto-shap library will run SHAP calculations in parallel to speed them up when possible! When we are using a Tree or Linear Explainer, we can calculate our SHAP values in parallel without issue. The results will be the same compared to when we run our calculations on a single core. Such situations are heavily tested in tests/tests.py in the GitHub Repo. However, the situation is slightly different when we are using the KernelExplainer. The KernelExplainer is not deterministic, even when we are not using parallel processing. In fact, especially on small datasets, if we re-run the KernelExplain back-to-back on the same data with the same model, we won't get the exact same feature-level attribution, though the total attribution will stay the same (which is tested in tests/tests.py). The foregoing points can be substantiated by looking at the SHAP documentation. This article discusses the deterministic nature of certain SHAP calculations.

In auto-shap, we still employ multiprocessing when using the KernelExplainer, knowing that our results would still not be perfectly deterministic even on a single core, and by using multiprocessing, we get a nice speed improvement. To turn off multiprocessing in this case if desired, set n_jobs=1 when calling generate_shap_values(). See more details on this function call below.

Additionally, there is a pickle error when using multiprocessing with a scikit-learn Voting or Stacking model with SHAP. Therefore, no multiprocessing is used in such cases.

At a high level, the library will automatically detect the type of model that has been trained (regressor vs. classifier, boosting model vs. other model, etc.) and apply the correct handling. If your model is not accurately identified by the library, it's easy to specify how it should be handled.

Installation

The easiest way to install the library is with pip.

$ pip3 install auto-shap

Quick Example

Once installed, SHAP values can be calculated as follows.

$ python3
>>> from auto_shap.auto_shap import generate_shap_values
>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.ensemble import ExtraTreesClassifier
>>> x, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> model = ExtraTreesClassifier()
>>> model.fit(x, y)
>>> shap_values_df, shap_expected_value, global_shap_df = generate_shap_values(model, x)

There you have it!

What's more, you can change to a completely new model without changing any of the auto-shap code.

$ python3
>>> from auto_shap.auto_shap import generate_shap_values
>>> from sklearn.datasets import load_diabetes
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> x, y = load_diabetes(return_X_y=True, as_frame=True)
>>> model = GradientBoostingRegressor()
>>> model.fit(x, y)
>>> shap_values_df, shap_expected_value, global_shap_df = generate_shap_values(model, x)

auto-shap detected this was a boosted regressor and handled such a case appropriately.

Saving Output

The library also provides a helper function for saving output and plots to a local directory.

$ python3
>>> from auto_shap.auto_shap import produce_shap_values_and_summary_plots
>>> from sklearn.datasets import load_diabetes
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> x, y = load_diabetes(return_X_y=True, as_frame=True)
>>> model = GradientBoostingRegressor()
>>> model.fit(x, y)
>>> produce_shap_values_and_summary_plots(model=model, x_df=x, save_path='shap_output')

The above code will save three files into a "files" subdirectory in the specified save_path directory.

Likewise, two plots will be saved into a "plots" subdirectory.

Multiprocessing Support

By default, the maximum number of cores available is used to calculate SHAP values in parallel. To manually set the number of cores to use, you can do the following.

>>> generate_shap_values(model, x_df, n_jobs=4)

For small datasets, multiprocessing may not add much in terms of performance and could even slow down computation times due to the overhead of spinning up a multiprocessing pool. To turn off multiprocessing, set n_jobs=1.

Overriding Auto-Detection

Using generate_shap_values() or produce_shap_values_and_summary_plots() will leverage auto-detection of certain model characteristics. Those characteristics are as follows, which are all controlled with Booleans:

Though auto-shap will natively handle most common models, it is not yet tuned to handle every possible type of model. Therefore, in some cases, you may have to manually set one or more of the above Booleans in the function calls. At present and at minimum, auto-shap will work with the following models.

For models cannot detect model qualities, it will fallback to using the Kernel Explainer, which is model agnostic.

CalibratedClassifierCV

The auto-shap library provides support for scikit-learn's CalibratedClassifierCV. This implementation will extract the SHAP values for every base estimator in the calibration ensemble. The SHAP values will then be averaged. For details on the CalibratedClassifierCV, please go to the documentation. Since we are extracting only the SHAP values for the base estimator, we will miss some detail since we are not using the full calibrator pair. Therefore, while these SHAP values will still be instructive, they will not be perfectly precise. For more precision, we would need to use the KernelExplainer. The main benefit of averaging the results of the base estimators is computational as the KernelExplainer can be quite slow.

To use KernelShap, one can do the following. More or less, this will ignore the auto-generated model qualities.

>>> generate_shap_values(model, x_df, use_kernel=True)

Since the Kenel Explainer can be computationally expensive, x_df can be subsampled by either the sample_size or the k parameters. The former will take random samples, and the latter will take k-means summarized samples.

Voting and Stacking Models

If auto-shap detects a voting or stacking model, it will automatically use the Kernel Explainer. The Kernel SHAP is computationally expensive, so you may want to use a sample of x_df or use the previously-discussed arguments.

Additionally, there is a pickle error when using multiprocessing with a scikit-learn Voting or Stacking model with SHAP. Therefore, no multiprocessing is used in such cases, which is more motivation for subsetting x_df.

Other Potentially Useful Functionality

The generate_shap_values function relies on a few underlying functions that can be accessed directly and have the corresponding arguments and datatypes.

get_shap_expected_value(explainer: callable, boosting_model: bool, linear_model) -> float

generate_shap_global_values(shap_values: np.array, x_df: pd.DataFrame) -> pd.DataFrame

def produce_shap_output_with_agnostic_explainer(model: callable, x_df: pd.DataFrame, boosting_model: bool,
                                                regression_model: bool, linear_model: bool,
                                                return_df: bool = True, n_jobs: int = None,
                                                sample_size: int = None, k: int = None) -> tuple

produce_shap_output_with_tree_explainer(model: callable, x_df: pd.DataFrame, boosting_model: bool,
                                        regression_model: bool, linear_model: bool,
                                        return_df: bool = True, n_jobs: int = None) -> tuple

produce_shap_output_with_linear_explainer(model: callable, x_df: pd.DataFrame, regression_model: bool,
                                          linear_model: bool, return_df: bool = True, n_jobs: int = None) -> tuple

produce_shap_output_for_calibrated_classifier(model: callable, x_df: pd.DataFrame, boosting_model: bool,
                                              linear_model: bool, n_jobs: int = None) -> tuple

def produce_raw_shap_values(model: callable, x_df: pd.DataFrame, use_agnostic: bool, linear_model: bool,
                            tree_model: bool, calibrated_model: bool, boosting_model: bool, regression_model: bool,
                            voting_or_stacking_model: bool = False, n_jobs: int = None, sample_size: int = None,
                            k: int = None) -> tuple

generate_shap_summary_plot(shap_values: np.array, x_df: pd.DataFrame, plot_type: str, save_path: str,
                           file_prefix: str = None)

The End

Enjoy explaining your models with auto-shap! Feel free to report any issues.