neulab / ExplainaBoard

Interpretable Evaluation for AI Systems
MIT License
359 stars 36 forks source link

Implement calibration analysis #529

Closed qjiang002 closed 1 year ago

qjiang002 commented 1 year ago

Related issue: #417

Current settings of calibration analysis

Example input: data/system_outputs/absa/absa-example-output-confidence.json

"examples": [
    {
      "predicted_label": "positive",
      "confidence": 0.22101897026820283
    }
    ...
]

Example output: data/reports/absa-confidence-report.json

{
        "name": "confidence",
        "level": "example",
        "bucket_performances": [
          {
            "n_samples": 9,
            "bucket_samples": [
              7,
              11,
              18,
              55,
              62,
              66,
              83,
              91,
              94
            ],
            "performances": {
              "Accuracy": {
                "value": 1.0,
                "confidence_score_low": null,
                "confidence_score_high": null,
                "auxiliary_result": {
                  "confidence": 0.05506609222867546
                }
              }
            },
            "bucket_interval": [
              0.0,
              0.1
            ],
            "bucket_name": null
          },
          ...
        ],
        "cls_name": "BucketAnalysisResult",
        "auxiliary_performances": {
          "expected_calibration_error": 0.4588287500885495,
          "maximum_calibration_error": 0.9449339077713246
        }
      }

Changes to existing classes

Testing

Test on tasks: text_classification, text_pair_classification, aspect_based_sentiment_analysis Using both text output file and json output file with only predicted_label, and the report generated successfully. Using json output file with predicted_label and confidence, the report generated successfully with calibration analysis (shown as confidence bucket analysis).

qjiang002 commented 1 year ago

In the new commit, CalibrationAnalysis and CalibrationAnalysisResult are created in analyses.py. The default setting is to divide the [0, 1] interval into 10 buckets. In the classification task's default setting, it will automatically look for confidence feature and perform calibration analysis on confidence. It also supports custom calibration analysis as shown in ./data/system_outputs/absa/absa-example-output-custom-calibration-analysis.json.

qjiang002 commented 1 year ago

Thank you for all your comments! In the new commit, in Metric evaluate_from_stats, add auxiliary_stats which is used to calculate auxiliary metric results like confidence. This is because confidence values are not in the original metric stats which is calculated in the get_overall_statistics. A new MetricStats with confidence values are calculated in CalibrationAnalysis.

odashi commented 1 year ago

@qjiang002 This error can be solved by using get_value_or_none instead (and you need to use assertIsNone)

https://github.com/neulab/ExplainaBoard/actions/runs/3212963391/jobs/5252276938#step:5:234