This PR makes a prototype of an interpretation module, which mainly aims to generate observations and suggestions for each BucketAnalysisResult.
Details
For example, given the bucket analysis result (BucketAnalysisResult) of span_length features, this PR could generate:
Observations
{'F1':
[
InterpretationObservation(
keywords='performance_description',
content='The largest performance gap between different buckets is 0.08127341977475944, and the best performance is 0.9279887482419127 worse performance is 0.8467153284671532'),
InterpretationObservation(
keywords='correlation_description',
content='The correlation between the model performance and feature value of span_length is: -1.0'),
InterpretationObservation(
keywords='unreliable_buckets',
content="The number of samples in these buckets['(4, 6)'] are relatively fewer (<= 100),which may result in a large variance."
)]}
Suggestions
{'F1':
[
InterpretationSuggestion(
keywords='correlation_description',
content='If the absolute value of correlation is greater than 0.9, it means that the performance of the system is highly affected by features. Consider improving the training samples under appropriate feature value of span_length to improve the model performance.'),
InterpretationSuggestion(
keywords='unreliable_buckets',
content='If the performance on these unreliable are also low, please check whether the corresponding samples in the training set are fewer as well, and consider introducing more samples to further improve the performance.'
)]}
References
Blocked by
Discussion1: I originally would like to put these codes in explainaboard_web, but think it would be suitable to place them here too. One interesting application scenario is that these observations or suggestions could play some role like lint or mypy: every time system developers evaluate their system using SDK or CLI, they could get textual observations and suggestions.
Discussion2: Achieving automated generation of observation and suggestions is not very easy, which requires the framework is flexible in incorporating manual rules. This PR draft is a version after some of my thoughts, but I'm open to further discussion to make it more general.
Important TODOs in future
Interpretation for Two systems or more systems
Reflect interpretation in ExplainaBoard SDK CLI
Reflect interpretation in ExplainaBoard web
Comprehensively incorporate the best practices or our domain knowledge into the Interpretation module
It looks this is a new feature of the library, and may happen some discussions. It is better to put an issue to track this feature rather than making discussion on each pull request.
Overview
This PR makes a prototype of an interpretation module, which mainly aims to generate
observations
andsuggestions
for each BucketAnalysisResult.Details
For example, given the bucket analysis result (
BucketAnalysisResult
) ofspan_length
features, this PR could generate:Observations
Suggestions
References
Blocked by
lint
ormypy
: every time system developers evaluate their system using SDK or CLI, they could get textual observations and suggestions.Important TODOs in future