Add support for shap interactive plots

Jose-Verdu-Diaz commented 1 year ago

Shap is a powerful library for model interpretation that creates interactive plots. These plots can be saved as HTML to embed somewhere else, as long as the required JavaScript is loaded. While exporting the plots as a static image file is possible, some functionality is lost.

I'm unsure of how plots are loaded into Studio, but it seems possible to modify the way dvc_plots/index.html is created to include:

The JavaScript library of Shap
The dumped Shap plot HTML

The HTML of the plot can be obtained as follows:

# model: trained model to explain
# X: training data

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap_plot = shap.force_plot(explainer.expected_value, shap_values, X)

shap_plot_html = shap_plot.html() # HTML code of the plot

The JavaScript can be obtained with the following command:

shap.getjs()

tapadipti commented 1 year ago

@Jose-Verdu-Diaz Thank-you for creating the issue. Can you pls share if and how you have used or plan to use Shap in the context of DVC projects? Any details you can share that can help determine the priority for this would be helpful.

Jose-Verdu-Diaz commented 1 year ago

@tapadipti Shap implements state-of-the-art techniques for AI explainability. The package is relatively new (around 5 years), but it gained huge popularity during this time and it has great potential to become the go-to package for model interpretability. During the development stage of a model, understanding the model predictions is critical for designing the data processing pipeline (feature selection, oversampling-undersampling, data imputation, ...) and for model optimization. Furthermore, model explainability helps in monitoring deployed models, as changes in the data post-deployment can lead to bugs that would be otherwise hard or impossible to detect. This nice paper explains in detail this, with precise real-life examples.

Currently, DVC allows to generate interactive plots from scikit-learn to monitor model training and model performance. Comparing model performance between versions is a critical feature that is already implemented. Implementing interactive shap plots would expand this, allowing to compare how different versions of the pipeline affect the way a model works. As an example: modifying a "data imputation" stage can introduce a bias in the model that increases the performance metrics. If the inner functioning of the model is not understood, this could go unnoticed to deployment.

I'm completely ignorant of how DVC creates the interactive plots shown in Studio, as I'm a new user of the Iterative suite. But my intuition is that it should be possible to embed the HTML of the shap plots and its javascript in the "plots" tab of Studio.

About the priority of this, I'm biased to give it a "mid" priority because I would start using this right away, but I understand that shap (18.9k stars) is not as popular as sklearn (53.6k stars) and that model explainability is a field that is still maturing, so this might not be a highly demanded feature right now.

tapadipti commented 1 year ago

@Jose-Verdu-Diaz Thanks a lot 🙏 for the detailed explanation and indications about priority. This is helpful 👍

iterative / studio-support

Add support for shap interactive plots #82