interpretml / interpret-community

Interpret Community extends Interpret repository with additional interpretability techniques and utility functions to handle real-world datasets and workflows.
https://interpret-community.readthedocs.io/en/latest/index.html
MIT License
422 stars 85 forks source link

Question. How good is my surrogate model? #502

Open SamiurRahman1 opened 2 years ago

SamiurRahman1 commented 2 years ago

Hi, i have seen that there is a function to calculate the r2 score of the surrogate model. I was wondering, are there any other simple metrics implemented to measure how good the surrogate model is?

Thanks

imatiach-msft commented 2 years ago

hi @SamiurRahman1 the score can be computed via get_surrogate_model_replication_measure which was just made public as part of resolving this issue: https://github.com/interpretml/interpret-community/issues/452 and PR: https://github.com/interpretml/interpret-community/pull/495 we currently don't have other metrics, but I think it may be possible to add more. Note that this is just a measure of how good the surrogate model fits the teacher model, it doesn't tell you how accurate the explanations themselves are - and in this case they are just approximations. Can you talk a bit more about your use-case? If you require the model to be interpretable, and the explanations can't be approximations, then you may want to consider using a glassbox model. You may also want to consider using permutation feature importance instead, which permutes columns one at a time on a trained model (there is another variant that retrains the model which is not implemented in this repository) and assigns the feature importance based on how much a chosen metric changes for the permuted column. Note there may be issues in that method for assigning importances if there are highly correlated features. Also, it is slower than the mimic explainer, and isn't really feasible if you have high dimensional data, including sparse data. Hope that info helps.

imatiach-msft commented 2 years ago

an amazing free book on interpretability has a great chapter on global surrogate models: https://christophm.github.io/interpretable-ml-book/global.html I think the sections on advantages and disadvantages summarize this method very well. Note it doesn't mention any other metrics besides R-squared, but I think we could add a lot of other metrics.

imatiach-msft commented 2 years ago

note that we use accuracy metric for classification and r^2 for regression currently:

   def get_surrogate_model_replication_measure(self, training_data):
        """Return the metric which tells how well the surrogate model replicates the teacher model.
        For classification scenarios, this function will return accuracy. For regression scenarios,
        this function will return r2_score.
        :param training_data: The data for getting the replication metric.
        :type training_data: numpy.ndarray or pandas.DataFrame or scipy.sparse.csr_matrix
        :return: Metric that tells how well the surrogate model replicates the behavior of teacher model.
        :rtype: float
        """

but I think a lot of other metrics could be added. I think it might even be interesting to run the surrogate model through error analysis where the "true" labels are actually the predicted labels from the teacher model to see where the surrogate model is making errors. You can find the ErrorAnalysisDashboard here: https://github.com/microsoft/responsible-ai-toolbox with a lot of notebook examples here: https://github.com/microsoft/responsible-ai-toolbox/tree/main/notebooks/individual-dashboards/erroranalysis-dashboard

SamiurRahman1 commented 2 years ago

thanks for your explanations. i might have formulated my question wrong. yes, i would like to understand or measure how well my surrogate model fits or represents my teacher model. i have read several research papers about different metrics like stability, robustness and efficiency. But i consider them as more advanced metrics. Hence i was looking for any other light-weight metrics like r2.

i have read the book that you mentioned and i found it very informative and useful. i also have used the current available get_surrogate_model_replication_measure function. thanks for suggesting the ErrorAnalysisDashboard, i will look into it.

my use case: i am trying to experiment whether the global interpretation differs when we use interpreters which are dependent on local interpreters(we get results by aggregating them) and when we use interpreters which don't depend on local interpreter(permutation feature importance). And if we get different list of important features from the two scenarios, i would like to use different metrics to measure which surrogate model is more fitting to the teacher model.

imatiach-msft commented 2 years ago

"i have read several research papers about different metrics like stability, robustness and efficiency" interesting, can you point to the papers specifically, maybe some of these could be implemented in this repository? We could create issues to mark these as methods that should be implemented.

"i am trying to experiment whether the global interpretation differs when we use interpreters which are dependent on local interpreters and when we use interpreters which don't depend on local interpreter" That sounds like really interesting research! I'm very curious to hear what you find.

SamiurRahman1 commented 2 years ago

here are some few example papers that talk about different evaluation methods for interpreters. the most i am interested in is number 2.

  1. https://arxiv.org/abs/1906.02108
  2. https://arxiv.org/abs/2008.05895
  3. https://arxiv.org/abs/1910.02065
imatiach-msft commented 2 years ago

I have a hard time believing the second paper's results that LIME is better than SHAP - perhaps on that dataset, but for LIME you need to set the kernel width parameter, which is very tricky to figure out. If you get it wrong you can get very bad results. SHAP doesn't have that problem. Also all of those datasets are too similar, none of them have high dimensional or sparse features it sounds like. Their results would be much more interesting if they evaluated on a wide range of datasets that vary a lot more.

SamiurRahman1 commented 2 years ago

i agree with your perspective. :) also, these papers are not from very good journals. but my main focus was the metrics. i am not worried about their results, rather the metrics they proposed to evaluate different interpreters. :)