princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.45k stars 240 forks source link

docker evaluation gets stuck #157

Open crhf opened 5 days ago

crhf commented 5 days ago

Describe the bug

Thanks for making the containerized evaluation environment, which will make the evaluation easier and more accurate! However, while I was trying it out, the containerized evaluation always got stuck in the middle. Did I missing something?

Steps/Code to Reproduce

python -m swebench.harness.run_evaluation --predictions_path /home/haifeng/projects/acr/experiment/06-22-lite/predictions_for_swebench.json --max_workers 28 --cache_level env --run_id princeton2

Expected Results

All the predictions get evaluated.

Actual Results

Evaluation got stuck in the middle:

Running 296 unevaluated instances...
Base image sweb.base.x86_64:latest already exists, skipping build.
Base images built successfully.
No environment images need to be built.
Running 296 instances...
 91%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                | 270/296 [23:25:03<2:15:18, 312.23s/it]

The progress bar hung here for tens of hours. Pressing Ctrl+C gave the following:

Traceback (most recent call last):
  File "/media/media0/haifeng/projects/SWE-bench-princeton/swebench/harness/run_evaluation.py", line 252, in run_instances
    for future in as_completed(futures):
  File "/home/haifeng/miniconda3/envs/swe-bench/lib/python3.9/concurrent/futures/_base.py", line 245, in as_completed
    waiter.event.wait(wait_timeout)
  File "/home/haifeng/miniconda3/envs/swe-bench/lib/python3.9/threading.py", line 581, in wait
    signaled = self._cond.wait(timeout)
  File "/home/haifeng/miniconda3/envs/swe-bench/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/haifeng/miniconda3/envs/swe-bench/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/haifeng/miniconda3/envs/swe-bench/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/media/media0/haifeng/projects/SWE-bench-princeton/swebench/harness/run_evaluation.py", line 529, in <module>
    main(**vars(args))
  File "/media/media0/haifeng/projects/SWE-bench-princeton/swebench/harness/run_evaluation.py", line 493, in main
    run_instances(predictions, dataset, cache_level, clean, force_rebuild, max_workers, run_id, timeout)
  File "/media/media0/haifeng/projects/SWE-bench-princeton/swebench/harness/run_evaluation.py", line 262, in run_instances
    continue
  File "/home/haifeng/miniconda3/envs/swe-bench/lib/python3.9/concurrent/futures/_base.py", line 637, in __exit__
    self.shutdown(wait=True)
  File "/home/haifeng/miniconda3/envs/swe-bench/lib/python3.9/concurrent/futures/thread.py", line 235, in shutdown
    t.join()
  File "/home/haifeng/miniconda3/envs/swe-bench/lib/python3.9/threading.py", line 1060, in join
    self._wait_for_tstate_lock()
  File "/home/haifeng/miniconda3/envs/swe-bench/lib/python3.9/threading.py", line 1080, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):

I tried three times, and the evaluations all hung at different points.

System Information

Ubuntu 20.04.6 LTS, Python 3.9, swebench 68d8059, more than 100 CPUs

klieret commented 5 days ago

Currently, the timeout when evaluating on an instance only seems to affect running eval.sh in the container. I wonder if there could be problems with some of the cleanup steps that hang instead...

crhf commented 3 days ago

Does this work at your end? I still get the same problem

john-b-yang commented 2 days ago

Hmm that's a bit weird, we haven't seen this before.

There are some task instances which we know take a very long time to run (e.g. some of the scikit-learn ones). However, the longest we've seen that take is 3 hours. Also, none of the very long running instances are in the lite split, which looks like what you're evaluating on.

It's a bit difficult to diagnose without knowing which task instances the evaluation is stuck on. Would you happen to have this info? Or perhaps given that 270/296 finished running, what are the 296 - 270 = 26 issues that haven't finished?