openai / evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Other
14.76k stars 2.58k forks source link

Updates on existing solvers and bugged tool eval #1506

Closed ojaffe closed 6 months ago

ojaffe commented 6 months ago

@JunShern will review this

Wrap solvers with completion functions for compatibility with pre-solver Evals. This means you can execute all evals using solvers. 49fd9ef

Add context length information about gpt-4-turbo-preview and gpt-4-0125-preview. 9a0ab1c

Move oai and together solvers into providers / subdir 063bf4f

Update the default task descriptions for bugged tools. We added more information when using gemini + OS models, since they got confused. 0523dd4

Modified the default solver chain-of-thought prompt, as well as other custom chain-of-thought prompts used in some evals. The default CoTSolver prompts were a bit misleading in some cases; we observed GeminiSolver working too hard to arrive at a final answer for the whole eval when it's in fact supposed to give just a response for the next turn. 287f3cf