Metrics with different analysis levels have the same name

qjiang002 commented 1 year ago

Some tasks may use the same metric on different analysis levels. Although they metric functions at different levels are different, they share the same name, and this leads to the same metric names in the report's overall performance. Such tasks are NER and argument pair extraction (APE).

This may cause problem in sorting the systems using the metric score. When changing list[(name, thing)] to dict[name]=thing, analysis levels should be one level above metric name in issue #491 .

NER default metrics

defaults: dict[str, dict[str, MetricConfig]] = {
    "example": {
        "F1": SeqF1ScoreConfig(
            source_language=source_language,
            target_language=target_language,
            tag_schema="bio",
        )
    },
    "span": {
        "F1": F1ScoreConfig(
            source_language=source_language,
            target_language=target_language,
            ignore_classes=[cls._DEFAULT_TAG],
        )
    },
}

NER analysis report

  "results": {
    "overall": [
      {
        "F1": {
          "value": 0.9221652220060144,
          "confidence_score_low": null,
          "confidence_score_high": null,
          "auxiliary_result": null
        }
      },
      {
        "F1": {
          "value": 0.9221652220060145,
          "confidence_score_low": null,
          "confidence_score_high": null,
          "auxiliary_result": null
        }
      }
    ]
}

APE default metrics

defaults: dict[str, dict[str, MetricConfig]] = {
    'example': {
        "F1": APEF1ScoreConfig(
            source_language=source_language,
            target_language=target_language,
        )
    },
    'block': {
        "F1": F1ScoreConfig(
            source_language=source_language,
            target_language=target_language,
            ignore_classes=[cls._DEFAULT_TAG],
        )
    },
}

APE analysis report

  "results": {
    "overall": [
      {
        "F1": {
          "value": 0.25625192960790366,
          "confidence_score_low": null,
          "confidence_score_high": null,
          "auxiliary_result": null
        }
      },
      {
        "F1": {
          "value": 0.25625192960790366,
          "confidence_score_low": null,
          "confidence_score_high": null,
          "auxiliary_result": null
        }
      }
    ]
}