melodysdreamj / WizardVicunaLM

LLM that combines the principles of wizardLM and vicunaLM
711 stars 34 forks source link

Evaluate the model with OpenAI Evals #12

Open walking-octopus opened 1 year ago

walking-octopus commented 1 year ago

After releasing GPT-4, OpenAI was met with a significant challenge: there weren't many benchmarks for LLMs focused on emergent capabilities like translation, reasoning, pattern identification, reasoning, etc. So they've created Evals, a coudsourced open-source set of benchmarks for LLMs. While somewhat OpenAI-centric, as the submission rules prohibit adding tests that GPT-4 can already consistently pass, it still remains a valuable tool for objective model evaluation.

If different open-access LLM projects can switch to a well-designed common benchmark, we may finally get to objectively compare our model quality, which I find essential for the future if local LLMs. For example, we may compare it against WizardLM, raw Vicuna, or GPT-3.5.

For reference on testing non OpenAI models with Evals, see OpenAssistant model evals.