Overview

This PR makes a prototype of an interpretation module, which mainly aims to generate observations and suggestions for each BucketAnalysisResult.

Details

For example, given the bucket analysis result (BucketAnalysisResult) of span_length features, this PR could generate:

Observations

{'F1': 
[
InterpretationObservation(
keywords='performance_description', 
content='The largest performance gap between different buckets is 0.08127341977475944, and the best performance is 0.9279887482419127 worse performance is 0.8467153284671532'),
InterpretationObservation(
keywords='correlation_description',
content='The correlation between the model performance and feature value of span_length is: -1.0'),
InterpretationObservation(
keywords='unreliable_buckets',
content="The number of samples in these buckets['(4, 6)'] are relatively fewer (<= 100),which may result in a large variance."
)]}

Suggestions

{'F1': 
[
InterpretationSuggestion(
keywords='correlation_description',
content='If the absolute value of correlation is greater than 0.9, it means that the performance of the system is highly affected by features. Consider improving the training samples under appropriate feature value of span_length to improve the model performance.'),
InterpretationSuggestion(
keywords='unreliable_buckets',
content='If the performance on these unreliable are also low, please check whether the corresponding samples in the training set are fewer as well, and consider introducing more samples to further improve the performance.'
)]}

References

Blocked by

Discussion1: I originally would like to put these codes in explainaboard_web, but think it would be suitable to place them here too. One interesting application scenario is that these observations or suggestions could play some role like lint or mypy: every time system developers evaluate their system using SDK or CLI, they could get textual observations and suggestions.
Discussion2: Achieving automated generation of observation and suggestions is not very easy, which requires the framework is flexible in incorporating manual rules. This PR draft is a version after some of my thoughts, but I'm open to further discussion to make it more general.

Important TODOs in future

Interpretation for Two systems or more systems
Reflect interpretation in ExplainaBoard SDK CLI
Reflect interpretation in ExplainaBoard web
Comprehensively incorporate the best practices or our domain knowledge into the Interpretation module
Towards data-driven interpretation

neulab / ExplainaBoard

prototype interpretation module - for bucket analysis #566