openai / evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Other
14.78k stars 2.58k forks source link

Eval-running often hangs on last sample #1384

Open sjadler2004 opened 11 months ago

sjadler2004 commented 11 months ago

Describe the bug

Relatively often, my eval-run will be at say samples 199/200 but then will hang for a very long period of time on the last one. It isn't clear to me why this occurs, but sometimes it'll persist as long as an hour or more, at which point I generally terminate the command from my CLI and try again

To Reproduce

I'm not sure how to make this happen every time unfortunately. It does seem more likely to happen on bigger sampling runs than small ones though.

Code snippets

No response

OS

macOS

Python version

Python v3.11

Library version

latest

sjadler2004 commented 11 months ago

Strangely, even after KeyboardInterrupt, it often takes a while for my Terminal to regain the ability to run normal commands after this error occurs - not sure if that helps to pin down the problem

LRudL commented 10 months ago

I also have this issue. It is not about rate limits, because it happens despite running datasets that are definitely below the tokens per minute and requests per minute rate limits. However, it does only seem to show up for large datasets.

An example of the error trace when I ctrl+C twice to exit after it gets stuck for a long time:

Traceback (most recent call last):
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/multiprocessing/pool.py", line 856, in next
    item = self._items.popleft()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
[...]
  File "/home/lrudl/[...]/evals/evals/cli/oaieval.py", line 223, in run
    result = eval.run(recorder)
  File "/home/lrudl/[...]/evals/evals/elsuite/modelgraded/classify.py", line 107, in run
    self.eval_all_samples(recorder, samples)
  File "/home/lrudl/[...]/evals/evals/eval.py", line 146, in eval_all_samples
    idx_and_result = list(tqdm(iter, total=len(work_items), disable=not show_progress))
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/multiprocessing/pool.py", line 861, in next
    self._cond.wait(timeout)
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py", line 320, in wait
    waiter.acquire()
KeyboardInterrupt

^CException ignored in: <module 'threading' from '/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py'>
Traceback (most recent call last):
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py", line 1537, in _shutdown
    atexit_call()
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/concurrent/futures/thread.py", line 31, in _python_exit
    t.join()
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py", line 1096, in join
    self._wait_for_tstate_lock()
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt: 

Often all I need to do is try again a few times for it to eventually run all the way to completion, but: (1) This massively increases the token cost. (2) This makes it difficult to efficiently run many evals in sequence with a script, because you need to manually supervise it and get it unstuck many times. This is a major time cost for big eval projects.

katsuya commented 10 months ago

It seems that this issue is influenced by a bug in tqdm, as discussed at https://github.com/tqdm/tqdm/issues/627. Applying the following patch significantly improved the situation.

diff -urN a/.venv/lib/python3.11/site-packages/evals/eval.py b/.venv/lib/python3.11/site-packages/evals/eval.py
--- a/.venv/lib/python3.11/site-packages/evals/eval.py  2023-11-29 12:55:58.214648049 +0900
+++ b/.venv/lib/python3.11/site-packages/evals/eval.py  2023-11-29 12:56:05.630671841 +0900
@@ -143,7 +143,8 @@
             else:
                 logger.info(f"Running in threaded mode with {threads} threads!")
                 iter = pool.imap_unordered(eval_sample, work_items)
-            idx_and_result = list(tqdm(iter, total=len(work_items), disable=not show_progress))
+            # idx_and_result = list(tqdm(iter, total=len(work_items), disable=not show_progress))
+            idx_and_result = list(iter)
         return [r for _, r in sorted(idx_and_result)]

     def get_samples(self):
isc-Shiva-Gupta commented 8 months ago

I also had this issue. A workaround I found is to use the EVALS_THREADS_TIMEOUT flag while running the command. It specifies the time allowed for every input to the model to run. It can be used as follows:

EVALS_THREADS_TIMEOUT=20 oaieval completion_fn eval_name