stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
https://crfm.stanford.edu/helm
Apache License 2.0
1.86k stars 243 forks source link

Question on metrics and metric's dependency #1890

Closed Anindyadeep closed 8 months ago

Anindyadeep commented 11 months ago

I have a question about the metrics that are defined on HELM. Currently, there are around 59 metrics and those are:

All the metrics currently provided by HELM **Accuracy** - none - Quasi-exact match - F1 - Exact match - RR@10 - NDCG@10 - ROUGE-2 - Bits/byte - Exact match (up to specified indicator) - Absolute difference - F1 (set match) - Equivalent - Equivalent (chain of thought) - pass@1 **Calibration** - Max prob - 1-bin expected calibration error - 10-bin expected calibration error - Selective coverage-accuracy area - Accuracy at 10% coverage - 1-bin expected calibration error (after Platt scaling) - 10-bin Expected Calibration Error (after Platt scaling) - Platt Scaling Coefficient - Platt Scaling Intercept **Robustness** - Quasi-exact match (perturbation: typos) - F1 (perturbation: typos) - Exact match (perturbation: typos) - RR@10 (perturbation: typos) - NDCG@10 (perturbation: typos) - Quasi-exact match (perturbation: synonyms) - F1 (perturbation: synonyms) - Exact match (perturbation: synonyms) - RR@10 (perturbation: synonyms) - NDCG@10 (perturbation: synonyms) **Fairness** - Quasi-exact match (perturbation: dialect) - F1 (perturbation: dialect) - Exact match (perturbation: dialect) - RR@10 (perturbation: dialect) - NDCG@10 (perturbation: dialect) - Quasi-exact match (perturbation: race) - F1 (perturbation: race) - Exact match (perturbation: race) - RR@10 (perturbation: race) - NDCG@10 (perturbation: race) - Quasi-exact match (perturbation: gender) - F1 (perturbation: gender) - Exact match (perturbation: gender) - RR@10 (perturbation: gender) - NDCG@10 (perturbation: gender) **Bias** - Stereotypical associations (race, profession) - Stereotypical associations (gender, profession) - Demographic representation (race) - Demographic representation (gender) **Toxicity** - Toxic fraction **Efficiency** - Observed inference runtime (s) - Idealized inference runtime (s) - Denoised inference runtime (s) - Estimated training emissions (kg CO2) - Estimated training energy cost (MWh) **General Information** - eval - train - truncated - prompt tokens - output tokens - trials **Summarization Metrics** - SummaC - QAFactEval - BERTScore (F1) - Coverage - Density - Compression - HumanEval-faithfulness - HumanEval-relevance - HumanEval-coherence **APPS Metrics** - Avg. # tests passed - Strict correctness **BBQ Metrics** - BBQ (ambiguous) - BBQ (unambiguous) **Copyright Metrics** - Longest common prefix length - Edit distance (Levenshtein) - Edit similarity (Levenshtein) **Disinformation Metrics** - Self-BLEU - Entropy (Monte Carlo) **Classification Metrics** - Macro-F1 - Micro-F1

Now out of this, I want to know how many of them are:

For example: Toxic Fraction uses Perpestive API to get the scores. Similarly, I wanted to know about the others too. Is there any way to do it right now?

Additional question

I also wanted to know that how the structuring or categorization of metrics are there. I am seeing lot of other frameworks which are categorizing the same benchmark or dataset into different categories and same dataset can have different metrics. This thing is confusing me and I am not seeing standardization set of spec here? Any thoughts on that?

Also super thanks to always reply back to so fast. Highly appreciate that.

yifanmai commented 11 months ago

For metrics in the paper, the vast majority of metrics are statistics formula based. Only BERTScore for summarization scenarios is model evaluation based (it computes similarity between BERT embeddings and thus relies on the BERT model). Only toxicity fraction is API service based. Additionally, some metrics don't fall into those three categories (e.g. we used human evaluators for summarization metrics). You can look at the metric implementations to see what they are doing, but we don't tag them in the code according to this categorization. The best way to get more information about these metrics is Appendix C in the HELM paper.

As for metrics that were more recently added to the framework but weren't in the original paper, we allow using LLMs to critique open-ended generations; that would fall under the "model evaluation based" and "API service based" categories.

The choice of how to structure metric groups is subjective (e.g. which metric to pick as an accuracy metric for a given scenario, which metric to pick to represent each group / desiderata). For many scenarios, we picked metrics that we believed were faithful to the original implementation or intentions of the original benchmark paper.

As for "lot of other frameworks which are categorizing the same benchmark or dataset into different categories and same dataset can have different metrics", let me know if you have a specific example and I might be able to give more details.

@rishibommasani and @teetone (lead co-author on the HELM paper) have thought a lot about metric selection and taxonomies and may have some comments here.

Anindyadeep commented 11 months ago

Ahh that's super helpful thanks @yifanmai

rishibommasani commented 11 months ago
  1. Everything Yifan describes mostly clarifies the details.
  2. The 59 metrics are at the level of the code (i.e. statistics the code computes). This is somewhat different from the semantic use of "metric" in the paper, which is at a slightly different level of abstraction (e.g. "Platt Scaling Coefficient" is not a measure of calibration, but is related to how miscalibrated a model is). If you want to categorize metrics, I would use the metric descriptions in the paper, as they will lend themselves to your three-way categorization scheme (and the paper text will make clear which one they are).
  3. The primary dependence on external APIs was Perspective at the time of the HELM paper in Nov 2022. Now, LM-based evaluations may also be implemented via APIs (e.g. the GPT-4 API), though they are more similar conceptually to evals like BERTScore in that they are based on a LM.

For the overall approach, the HELM paper's main sections (and maybe even just the introduction) lay out the philosophy quite explicitly.

  1. Datasets specify the inputs fed into a model such that it produces outputs. They can be fully factorized from how a model's outputs are then scored. For example, the model's outputs can be assessed for how accurate, robust, fair, or calibrated they are.
  2. Datasets can map to different categories. The MMLU dataset is simultaneously a question answering dataset (a task-based categorization) and a knowledge-centric dataset (a capability/aspect-based categorization).