Penalize features that have high variance in shap values when computing `calculate_shap_importance`

markdregan commented 1 year ago

Problem Description Two features can have the same mean_abs_shap value as computed by calculate_shap_importance. But the underlying shap values averaged over can be very different. eg. One can be very coherent and have small variance, the other could have very high variance. When building an ML model - I would argue we prefer features that are more consistently important. My proposal is a small adjustment to the mean_abs_shap calculation that accounts for the variance of the underlying shap values.

Desired Outcome Update calculation of mean_abs_shap in calculate_shap_importance to account for the std of the underlying shap values. For example: shap_abs_mean = np.mean(np.abs(shap_values), axis=0) - np.std(np.abs(shap_values), axis=0) / 2.0 The outcome of this adjustment would result in features with high variance in shap values being penalised slightly in the ranking of feature importance in calculate_shap_importance.

Solution Outline In shap_helpers.py, one line adjustment: shap_abs_mean = np.mean(np.abs(shap_values), axis=0) - np.std(np.abs(shap_values), axis=0) / 2.0

Question: should there be a parameter added to fit that would control turning this on/off.

Happy to submit PR to this proposal. LMK if this would be of interest.

ReinierKoops commented 1 year ago

I like the idea! However, I’d like to keep the default the same. So could you implement such that you can provide a parameter for this option?

ReinierKoops commented 1 year ago

Big thanks for your contribution!

ing-bank / probatus

Penalize features that have high variance in shap values when computing `calculate_shap_importance` #216