Implement calibration analysis

qjiang002 commented 1 year ago

Related issue: #417

Current settings of calibration analysis

If there is Accuracy metric in the task's default metrics, automatically detect if there is confidence feature in the output json file and perform calibration analysis. It only supports json output file now. The tasks are text_classification, text_pair_classification, tabular_classification, aspect_based_sentiment_analysis
The calibration analysis is formulated as a bucket analysis of the "confidence" feature. The buckets are fixed as 10 buckets evenly dividing the (0, 1) interval. This can lead to empty buckets and imbalanced buckets, but it allows comparison between different systems. The calibration errors ECE and MCE also depends on the bucketing, so keeping the bucketing fixed may reflect the performance change more accurately.

Example input: data/system_outputs/absa/absa-example-output-confidence.json

"examples": [
    {
      "predicted_label": "positive",
      "confidence": 0.22101897026820283
    }
    ...
]

Example output: data/reports/absa-confidence-report.json

{
        "name": "confidence",
        "level": "example",
        "bucket_performances": [
          {
            "n_samples": 9,
            "bucket_samples": [
              7,
              11,
              18,
              55,
              62,
              66,
              83,
              91,
              94
            ],
            "performances": {
              "Accuracy": {
                "value": 1.0,
                "confidence_score_low": null,
                "confidence_score_high": null,
                "auxiliary_result": {
                  "confidence": 0.05506609222867546
                }
              }
            },
            "bucket_interval": [
              0.0,
              0.1
            ],
            "bucket_name": null
          },
          ...
        ],
        "cls_name": "BucketAnalysisResult",
        "auxiliary_performances": {
          "expected_calibration_error": 0.4588287500885495,
          "maximum_calibration_error": 0.9449339077713246
        }
      }

Changes to existing classes

Set a skippable property to FileLoader, FeatureType and Analysis classes. This is to automatically skip loading confidence feature or performing bucket analysis on confidence feature.
Put bucket average confidence value as AuxiliaryMetricResult in Accuracy's Performance.auxiliary_result.
Create BucketAnalysisResult.auxiliary_performances which has type AuxiliaryAnalysisResult to hold CalibrationAnalysisResult. CalibrationAnalysisResult holds the calibration errors ECE and MCE.
Define confidence feature, calibration analysis, and json file loader fields in the classification tasks' loaders and processors.

Testing

Test on tasks: text_classification, text_pair_classification, aspect_based_sentiment_analysis Using both text output file and json output file with only predicted_label, and the report generated successfully. Using json output file with predicted_label and confidence, the report generated successfully with calibration analysis (shown as confidence bucket analysis).

qjiang002 commented 1 year ago

In the new commit, CalibrationAnalysis and CalibrationAnalysisResult are created in analyses.py. The default setting is to divide the [0, 1] interval into 10 buckets. In the classification task's default setting, it will automatically look for confidence feature and perform calibration analysis on confidence. It also supports custom calibration analysis as shown in ./data/system_outputs/absa/absa-example-output-custom-calibration-analysis.json.

qjiang002 commented 1 year ago

Thank you for all your comments! In the new commit, in Metric evaluate_from_stats, add auxiliary_stats which is used to calculate auxiliary metric results like confidence. This is because confidence values are not in the original metric stats which is calculated in the get_overall_statistics. A new MetricStats with confidence values are calculated in CalibrationAnalysis.

odashi commented 1 year ago

@qjiang002 This error can be solved by using get_value_or_none instead (and you need to use assertIsNone)

https://github.com/neulab/ExplainaBoard/actions/runs/3212963391/jobs/5252276938#step:5:234

neulab / ExplainaBoard