Closed Anindyadeep closed 11 months ago
For metrics in the paper, the vast majority of metrics are statistics formula based. Only BERTScore for summarization scenarios is model evaluation based (it computes similarity between BERT embeddings and thus relies on the BERT model). Only toxicity fraction is API service based. Additionally, some metrics don't fall into those three categories (e.g. we used human evaluators for summarization metrics). You can look at the metric implementations to see what they are doing, but we don't tag them in the code according to this categorization. The best way to get more information about these metrics is Appendix C in the HELM paper.
As for metrics that were more recently added to the framework but weren't in the original paper, we allow using LLMs to critique open-ended generations; that would fall under the "model evaluation based" and "API service based" categories.
The choice of how to structure metric groups is subjective (e.g. which metric to pick as an accuracy metric for a given scenario, which metric to pick to represent each group / desiderata). For many scenarios, we picked metrics that we believed were faithful to the original implementation or intentions of the original benchmark paper.
As for "lot of other frameworks which are categorizing the same benchmark or dataset into different categories and same dataset can have different metrics", let me know if you have a specific example and I might be able to give more details.
@rishibommasani and @teetone (lead co-author on the HELM paper) have thought a lot about metric selection and taxonomies and may have some comments here.
Ahh that's super helpful thanks @yifanmai
For the overall approach, the HELM paper's main sections (and maybe even just the introduction) lay out the philosophy quite explicitly.
I have a question about the metrics that are defined on HELM. Currently, there are around 59 metrics and those are:
All the metrics currently provided by HELM
**Accuracy** - none - Quasi-exact match - F1 - Exact match - RR@10 - NDCG@10 - ROUGE-2 - Bits/byte - Exact match (up to specified indicator) - Absolute difference - F1 (set match) - Equivalent - Equivalent (chain of thought) - pass@1 **Calibration** - Max prob - 1-bin expected calibration error - 10-bin expected calibration error - Selective coverage-accuracy area - Accuracy at 10% coverage - 1-bin expected calibration error (after Platt scaling) - 10-bin Expected Calibration Error (after Platt scaling) - Platt Scaling Coefficient - Platt Scaling Intercept **Robustness** - Quasi-exact match (perturbation: typos) - F1 (perturbation: typos) - Exact match (perturbation: typos) - RR@10 (perturbation: typos) - NDCG@10 (perturbation: typos) - Quasi-exact match (perturbation: synonyms) - F1 (perturbation: synonyms) - Exact match (perturbation: synonyms) - RR@10 (perturbation: synonyms) - NDCG@10 (perturbation: synonyms) **Fairness** - Quasi-exact match (perturbation: dialect) - F1 (perturbation: dialect) - Exact match (perturbation: dialect) - RR@10 (perturbation: dialect) - NDCG@10 (perturbation: dialect) - Quasi-exact match (perturbation: race) - F1 (perturbation: race) - Exact match (perturbation: race) - RR@10 (perturbation: race) - NDCG@10 (perturbation: race) - Quasi-exact match (perturbation: gender) - F1 (perturbation: gender) - Exact match (perturbation: gender) - RR@10 (perturbation: gender) - NDCG@10 (perturbation: gender) **Bias** - Stereotypical associations (race, profession) - Stereotypical associations (gender, profession) - Demographic representation (race) - Demographic representation (gender) **Toxicity** - Toxic fraction **Efficiency** - Observed inference runtime (s) - Idealized inference runtime (s) - Denoised inference runtime (s) - Estimated training emissions (kg CO2) - Estimated training energy cost (MWh) **General Information** - eval - train - truncated - prompt tokens - output tokens - trials **Summarization Metrics** - SummaC - QAFactEval - BERTScore (F1) - Coverage - Density - Compression - HumanEval-faithfulness - HumanEval-relevance - HumanEval-coherence **APPS Metrics** - Avg. # tests passed - Strict correctness **BBQ Metrics** - BBQ (ambiguous) - BBQ (unambiguous) **Copyright Metrics** - Longest common prefix length - Edit distance (Levenshtein) - Edit similarity (Levenshtein) **Disinformation Metrics** - Self-BLEU - Entropy (Monte Carlo) **Classification Metrics** - Macro-F1 - Micro-F1Now out of this, I want to know how many of them are:
For example: Toxic Fraction uses Perpestive API to get the scores. Similarly, I wanted to know about the others too. Is there any way to do it right now?
Additional question
I also wanted to know that how the structuring or categorization of metrics are there. I am seeing lot of other frameworks which are categorizing the same benchmark or dataset into different categories and same dataset can have different metrics. This thing is confusing me and I am not seeing standardization set of spec here? Any thoughts on that?
Also super thanks to always reply back to so fast. Highly appreciate that.