Closed percyliang closed 1 year ago
Below is my proposal for how to aggregate; it is reasonable enough, but even still I am not sold on it.
As I raise on Slack, my overall opinion is to not do this until we get it right (since it will explicitly simplify hill-climbing and hence be optimized, and its clear it will be the default people, and for all these reasons I would want it to be the right thing, since I do not want the benchmark to contribute to people over-optimizing for the wrong thing and therefore degrading the reputation of the benchmark - i.e. standard Goodhart's law), beyond the simply scores -> ranking for a fixed (scenario, metric) pair, which is fully unambiguous.
The proposal itself consider two routes:
Both are implicitly about providing probabiltiy distributions over the space of scenarios x metrics
, which have support on (s, m)
pairs that exist (e.g. no mass on (XSUM, robustness)
) by construction.
The left proposal uses the distribution to aggregate in the space of scores, whereas the right proposal uses the distribution to aggregate in the space of rankings. The right seems clearly better. But the remainder principally involves:
pi \in S^{num_models} -> v \in \mathbb{R}^{num_models}
), such that per-scenario, per-metric rankings can be aggregated (where S is the set of permutations, i.e. the symmetric group). This will have a very clear and strong effect; I am especially not convinced in my proposal here, though this is a standard problem and there probably are good known defaults in various literatures/communities. I haven't processed the proposals yet, but is it possible to get a strawperson in place so we have at least have something and making it better will be the same infrastructure, but just tweaking the weights?
We are clearly computing something right now for the paper (@dtsip ) - could we surface that on the website too?
For the paper, we are ranking the models on each generic task and then aggregating this into a single ranking. I can definitely add this piece of logic and use it to sort Tables. Not sure if we have a more complex ranking in mind.
@percyliang Currently we are not aggregating across metrics, and I think what we do across scenarios (average rank) is a reasonable baseline. I am going to hand this issue on this more narrow aggregation to dimitris, and I think release will not involve any further aggregation (and the paper already discusses this in the future work section extensively).
That's fine for now. I think whatever we do for the paper should be reflected on the website.
What's the status of getting the paper results onto the website?
Following up discussion on Slack, what we mainly want here is adding the various ranks to models.json
by adding a stats
field to each models and Serializing some instance of
class ModelStats:
model: str
costs: Dict[str, int]
ranks: Dict[str, float]
Computed the ranked in https://github.com/stanford-crfm/helm/pull/1240. If we are happy with it, I can propagate them to some global ModelStats
.
Since we have the average win rate table, this is no longer urgent.
After chatting with Percy, it seems that we are happy with how rankings are currently displayed on the website and we don't want to create additional ranking views.
Proposal: for each model, scenario group, and metric, compute the ranking of that that model R(model, scenario group, metric). We then define the quality of a model to be some weighted combination over scenario groups (probably some simple average) and metrics (some weighted average that that we tune).
Need to compute this in
summarize.py
and show it in the frontend somehow.Important note: we need to exclude all contaminated entries.