Closed qjiang002 closed 1 year ago
In the new commit, CalibrationAnalysis
and CalibrationAnalysisResult
are created in analyses.py
. The default setting is to divide the [0, 1] interval into 10 buckets. In the classification task's default setting, it will automatically look for confidence
feature and perform calibration analysis on confidence
. It also supports custom calibration analysis as shown in ./data/system_outputs/absa/absa-example-output-custom-calibration-analysis.json
.
Thank you for all your comments!
In the new commit, in Metric evaluate_from_stats
, add auxiliary_stats
which is used to calculate auxiliary metric results like confidence. This is because confidence values are not in the original metric stats which is calculated in the get_overall_statistics
. A new MetricStats
with confidence values are calculated in CalibrationAnalysis
.
@qjiang002 This error can be solved by using get_value_or_none
instead (and you need to use assertIsNone
)
https://github.com/neulab/ExplainaBoard/actions/runs/3212963391/jobs/5252276938#step:5:234
Related issue: #417
Current settings of calibration analysis
Example input:
data/system_outputs/absa/absa-example-output-confidence.json
Example output:
data/reports/absa-confidence-report.json
Changes to existing classes
skippable
property to FileLoader, FeatureType and Analysis classes. This is to automatically skip loading confidence feature or performing bucket analysis on confidence feature.AuxiliaryMetricResult
in Accuracy'sPerformance.auxiliary_result
.BucketAnalysisResult.auxiliary_performances
which has typeAuxiliaryAnalysisResult
to holdCalibrationAnalysisResult
.CalibrationAnalysisResult
holds the calibration errors ECE and MCE.Testing
Test on tasks: text_classification, text_pair_classification, aspect_based_sentiment_analysis Using both text output file and json output file with only predicted_label, and the report generated successfully. Using json output file with predicted_label and confidence, the report generated successfully with calibration analysis (shown as confidence bucket analysis).