[enhancement] can we make this set of leaderboard into a control panel?

stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).

https://crfm.stanford.edu/helm

Apache License 2.0

1.89k stars 243 forks source link

[enhancement] can we make this set of leaderboard into a control panel? #2041

Closed zhimin-z closed 2 months ago

zhimin-z commented 10 months ago

I found synthetic efficiency has 60 sub-leaderboards based on the num_prompt_tokens, num_instances, and tokenizer. For NLP practitioners, it is super hard to locate the needed leaderboard at first glance. I wonder if you can make this table into a control panel so they can retrieve the needed information quickly.

Another manner is to provide a manner to retrieve all the leaderboards in a programmatical manner so they can check any of those interested: https://github.com/stanford-crfm/helm/issues/2026

JosselinSomervilleRoberts commented 10 months ago

cc @farzaank

yifanmai commented 8 months ago

This is a quirk in summarize.py - we tried to include only the relevant adapter attributes, but it doesn't actually know which attributes are relevant and has to heuristically guess.

If you'd like a more user-friendly UI, you could look at the v0.3.0 version of this page on the old frontend.

zhimin-z commented 8 months ago

This is a quirk in summarize.py - we tried to include only the relevant adapter attributes, but it doesn't actually know which attributes are relevant and has to heuristically guess.

If you'd like a more user-friendly UI, you could look at the v0.3.0 version of this page on the old frontend.

Thanks, but since num_instances=10 is the default setting for all 60 leaderboards, why bother to include it in the name of each leaderboard?

yifanmai commented 2 months ago

Classic is archived; won't fix.