princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.8k stars 312 forks source link

swe-bench can get badly stuck in `future.result()` #158

Closed klieret closed 3 months ago

klieret commented 3 months ago

Describe the bug

This might be similar or identical to #157, but I cannot terminate the program with ^C as reported there (I have to kill python from a different shell).

SWE-bench specifically seems to get stuck when calling future.result().

I'm specifically seeing this for pydicom__pydicom-1192 (but have yet to try to reproduce this with a fresh run).

Steps/Code to Reproduce

python -m swebench.harness.run_evaluation \                                                        ─╯
        --predictions_path ../trajectories/klieret/azure-gpt4__dev23__default__t-0.00__p-0.95__c-3.00__install-1/all_preds.jsonl \
        --max_workers 1 \
        --run_id test \
        --dataset_name ../data/dev23.json

Expected Results

...

Actual Results

Running 1 unevaluated instances...
Base image sweb.base.x86_64:latest already exists, skipping build.
Base images built successfully.
No environment images need to be built.
Running 1 instances...
pydicom__pydicom-1192
  0%|                                                                               | 0/1 [00:00<?, ?it/s]Completed. updating pbar
100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.22s/it]done
Getting result

I had added a bit more print statements:

231         with ThreadPoolExecutor(max_workers=max_workers) as executor:
232             # Create a future for running each instance
233             futures = {
234                 executor.submit(
235                     run_instance,
236                     test_spec,
237                     predictions[test_spec.instance_id],
238                     should_remove(
239                         test_spec.instance_image_key,
240                         cache_level,
241                         clean,
242                         existing_images,
243                     ),
244                     force_rebuild,
245                     client,
246                     run_id,
247                     timeout,
248                 ): None
249                 for test_spec in test_specs
250             }
251             # Wait for each future to complete
252             for future in as_completed(futures):
253                 print("Completed. updating pbar")
254                 pbar.update(1)
255                 try:
256                     print("Getting result")
257                     # Update progress bar, check if instance ran successfully
258                     future.result()
259                     print("Done")
260                 except EvaluationError as e:
261                     print(f"EvaluationError {e.instance_id}: {e}")
262                     continue
263                 except Exception as e:
264                     print("Other exception")
265                     traceback.print_exc()
266                     continue
267     print("All instances run.")

System Information

Linux bitbop 6.5.0-1020-aws #20~22.04.1-Ubuntu SMP Wed May 1 16:10:50 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

klieret commented 3 months ago

Things I've tried to fix this:

# No effect
future.result(timeout=10)
klieret commented 3 months ago

No image build log is available, so most likely this issue comes from building the instance image