After releasing GPT-4, OpenAI was met with a significant challenge: there weren't many benchmarks for LLMs focused on emergent capabilities like translation, reasoning, pattern identification, reasoning, etc. So they've created Evals, a coudsourced open-source set of benchmarks for LLMs. While somewhat OpenAI-centric, as the submission rules prohibit adding tests that GPT-4 can already consistently pass, it still remains a valuable tool for objective model evaluation.
If different open-access LLM projects can switch to a well-designed common benchmark, we may finally get to objectively compare our model quality, which I find essential for the future if local LLMs. For example, we may compare it against WizardLM, raw Vicuna, or GPT-3.5.
After releasing GPT-4, OpenAI was met with a significant challenge: there weren't many benchmarks for LLMs focused on emergent capabilities like translation, reasoning, pattern identification, reasoning, etc. So they've created Evals, a coudsourced open-source set of benchmarks for LLMs. While somewhat OpenAI-centric, as the submission rules prohibit adding tests that GPT-4 can already consistently pass, it still remains a valuable tool for objective model evaluation.
If different open-access LLM projects can switch to a well-designed common benchmark, we may finally get to objectively compare our model quality, which I find essential for the future if local LLMs. For example, we may compare it against WizardLM, raw Vicuna, or GPT-3.5.
For reference on testing non OpenAI models with Evals, see OpenAssistant model evals.