qiime2 / q2-sample-classifier

QIIME 2 plugin for machine learning prediction of sample data.
BSD 3-Clause "New" or "Revised" License
7 stars 36 forks source link

Support for SHAP #219

Open mortonjt opened 1 year ago

mortonjt commented 1 year ago

Addition Description SHAP is one of the state-of-the-art methods for computing feature importance using concepts from game theory. Briefly, for each prediction, SHAP will estimate how much each feature contributed to the prediction, by computing leave-one-feature-out estimation across all possible subsets of features (making it optimal, while being scalable). Shapely values can be positive or negative, indicating if a feature contributed "positively" or "negatively" to a prediction. See original paper for details as well as the follow up solution for tree-ensemble methods

Current Behavior Feature importance is estimated based on leave-one-feature out estimation, based on only the full table (i.e. for 1000 features, feature importance is based on 1000 iterations of leaving out a feature). Feature importances are strictly positive, so directionality cannot be inferred. It is also suboptimal.

Proposed Behavior It would be useful if there is a separate method that computes Shapley values for Gradient Boosting or Random Forests classifiers. The syntax is simple, requiring 2 lines of additional code after fitting the model (see here). I have verified that this code is functional.

Questions

  1. Would having an optional dependency to the Shap package acceptable? If there is a separate command, it is easier to self-contain without needing to add Shap as a required dependency to the entire QIIME2 suite.

Comments

  1. There are many options for visualizations in terms of visualization overall contribution and interactions between features. While the forceplot is a reasonable default visualization, but I think having the outputted Shapely values should be the minimum output, since there are so many use cases for interpreting them.

References

  1. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html
  2. https://www.nature.com/articles/s42256-019-0138-9
nbokulich commented 1 year ago

Hi @mortonjt ,

Thanks for opening this feature request! Adding a SHAP wrapper has been on the unwritten issue list for some time now 😁

Would you be interested in working on this method?

Would having an optional dependency to the Shap package acceptable? If there is a separate command, it is easier to self-contain without needing to add Shap as a required dependency to the entire QIIME2 suite.

Technically this would be possible but maybe not desirable, as it complicates installation. How large are the SHAP packages? I think we should make SHAP a dependency if the license is compatible, and as long as it does not introduce conflicts. CC: @ebolyen @misialq for any thoughts on this.

There are many options for visualizations in terms of visualization overall contribution and interactions between features. While the forceplot is a reasonable default visualization, but I think having the outputted Shapely values should be the minimum output, since there are so many use cases for interpreting them.

Yes! I agree, output the SHAP values and these can be passed to different other plots... this gives more flexibility also in case other relevant visualization options are added in other Q2 plugins.

cc: @adamovanja

lizgehret commented 6 months ago

Closing this issue since the related PRs have been closed.

nbokulich commented 2 months ago

Re-opening as this has been requested again on the QIIME 2 forum.

Forum x-ref