neulab / ExplainaBoard

Interpretable Evaluation for AI Systems
MIT License
360 stars 36 forks source link

prototype interpretation module - for bucket analysis #566

Closed pfliu-nlp closed 1 year ago

pfliu-nlp commented 1 year ago

Overview

This PR makes a prototype of an interpretation module, which mainly aims to generate observations and suggestions for each BucketAnalysisResult.

Details

For example, given the bucket analysis result (BucketAnalysisResult) of span_length features, this PR could generate:

Observations

{'F1': 
[
InterpretationObservation(
keywords='performance_description', 
content='The largest performance gap between different buckets is 0.08127341977475944, and the best performance is 0.9279887482419127 worse performance is 0.8467153284671532'),
InterpretationObservation(
keywords='correlation_description',
content='The correlation between the model performance and feature value of span_length is: -1.0'),
InterpretationObservation(
keywords='unreliable_buckets',
content="The number of samples in these buckets['(4, 6)'] are relatively fewer (<= 100),which may result in a large variance."
)]}

Suggestions

{'F1': 
[
InterpretationSuggestion(
keywords='correlation_description',
content='If the absolute value of correlation is greater than 0.9, it means that the performance of the system is highly affected by features. Consider improving the training samples under appropriate feature value of span_length to improve the model performance.'),
InterpretationSuggestion(
keywords='unreliable_buckets',
content='If the performance on these unreliable are also low, please check whether the corresponding samples in the training set are fewer as well, and consider introducing more samples to further improve the performance.'
)]}

References

Blocked by

Important TODOs in future

odashi commented 1 year ago

It looks this is a new feature of the library, and may happen some discussions. It is better to put an issue to track this feature rather than making discussion on each pull request.