Mini-leaderboards show worst models first when worst_is_better: true

stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).

https://crfm.stanford.edu/helm

Apache License 2.0

1.77k stars 235 forks source link

Mini-leaderboards show worst models first when worst_is_better: true #2711

Closed yifanmai closed 1 month ago

yifanmai commented 1 month ago

For metrics with lower_is_better: true, the mini leaderboard shows the worst models rather than the best models