stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).
https://crfm.stanford.edu/helm
Apache License 2.0
1.9k stars 244 forks source link

compute basic aggregation so that we can rank models #966

Closed percyliang closed 1 year ago

percyliang commented 2 years ago

Proposal: for each model, scenario group, and metric, compute the ranking of that that model R(model, scenario group, metric). We then define the quality of a model to be some weighted combination over scenario groups (probably some simple average) and metrics (some weighted average that that we tune).

Need to compute this in summarize.py and show it in the frontend somehow.

Important note: we need to exclude all contaminated entries.

rishibommasani commented 2 years ago

Below is my proposal for how to aggregate; it is reasonable enough, but even still I am not sold on it.

As I raise on Slack, my overall opinion is to not do this until we get it right (since it will explicitly simplify hill-climbing and hence be optimized, and its clear it will be the default people, and for all these reasons I would want it to be the right thing, since I do not want the benchmark to contribute to people over-optimizing for the wrong thing and therefore degrading the reputation of the benchmark - i.e. standard Goodhart's law), beyond the simply scores -> ranking for a fixed (scenario, metric) pair, which is fully unambiguous.

20220908_203301

rishibommasani commented 2 years ago

The proposal itself consider two routes: Both are implicitly about providing probabiltiy distributions over the space of scenarios x metrics, which have support on (s, m) pairs that exist (e.g. no mass on (XSUM, robustness)) by construction.

The left proposal uses the distribution to aggregate in the space of scores, whereas the right proposal uses the distribution to aggregate in the space of rankings. The right seems clearly better. But the remainder principally involves:

  1. Weighting (scenario, metrics) in this distribution: the proposal of weights I give is not unreasonable, but not convinced it is the best default. One could even question how well-posed doing this from first principles is, as opposed to either some kind of preference inference in the aesthetic of alignment folks or voting in the aesthetic of politics/democratic folks.
  2. Mapping a ranking to a vector of scores (i.e. pi \in S^{num_models} -> v \in \mathbb{R}^{num_models}), such that per-scenario, per-metric rankings can be aggregated (where S is the set of permutations, i.e. the symmetric group). This will have a very clear and strong effect; I am especially not convinced in my proposal here, though this is a standard problem and there probably are good known defaults in various literatures/communities.
percyliang commented 2 years ago

I haven't processed the proposals yet, but is it possible to get a strawperson in place so we have at least have something and making it better will be the same infrastructure, but just tweaking the weights?

percyliang commented 1 year ago

We are clearly computing something right now for the paper (@dtsip ) - could we surface that on the website too?

dtsip commented 1 year ago

For the paper, we are ranking the models on each generic task and then aggregating this into a single ranking. I can definitely add this piece of logic and use it to sort Tables. Not sure if we have a more complex ranking in mind.

rishibommasani commented 1 year ago

@percyliang Currently we are not aggregating across metrics, and I think what we do across scenarios (average rank) is a reasonable baseline. I am going to hand this issue on this more narrow aggregation to dimitris, and I think release will not involve any further aggregation (and the paper already discusses this in the future work section extensively).

percyliang commented 1 year ago

That's fine for now. I think whatever we do for the paper should be reflected on the website.

percyliang commented 1 year ago

What's the status of getting the paper results onto the website?

dtsip commented 1 year ago

Following up discussion on Slack, what we mainly want here is adding the various ranks to models.json by adding a stats field to each models and Serializing some instance of

class ModelStats:
  model: str
  costs: Dict[str, int]
  ranks: Dict[str, float]
dtsip commented 1 year ago

Computed the ranked in https://github.com/stanford-crfm/helm/pull/1240. If we are happy with it, I can propagate them to some global ModelStats.

percyliang commented 1 year ago

Since we have the average win rate table, this is no longer urgent.

dtsip commented 1 year ago

After chatting with Percy, it seems that we are happy with how rankings are currently displayed on the website and we don't want to create additional ranking views.