openai / evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Other
14.36k stars 2.55k forks source link

Address sporadic hanging of evals on certain samples #1482

Closed thesofakillers closed 3 months ago

thesofakillers commented 3 months ago

As has been brought up before (#1384, #1292, https://github.com/openai/evals/pull/270), evals suffer from a hanging issue, where an evaluation run will hang for a very long time (if not indefinitely) at the end of a run (say, on the 99th sample of out 100).

This PR addresses this issue, by replacing a seemingly redundant single-threaded thread creation that was happening when making requests, nested inside the already multi-threaded eval loop. My impression is that this nested multithreading was causing overhead that resulted in the hanging experienced.

I had also noticed this hanging issue in EVALS_SEQUENTIAL=1 mode (where it no longer occurs at the end, but instead randomly in the middle of the run).

I was able to identify the source of this issue though debugging print statements that ultimately pointed to the request_with_timeout function as the culprit.

We have tested the new request_with_timeout code on a fork where we have run multiple new and pre-existing evals, including with 3rd party solvers, and found no change in behaviour or errors, and a clear improvement on the hanging issue.