swe-bench eval stops running after a point

ssh-randy commented 1 month ago

Describe the bug

I'm trying to run swe-bench remotely on a google compute engine VM (running on an n2-highmem-4 using the common-core image). However, no matter how i configure it it always seems to fail at around 170ish test cases, and gets this error:

subprocess.CalledProcessError: Command '. /workspace/GPT-4-Turbo/django/3.2/tmpu4_g7e7p/miniconda3/bin/activate djangodjango3.2 && echo 'activate successful' && pip install -r /workspace/GPT-4-Turbo/django/3.2/tmp91t3yumh/requirements.txt' returned non-zero exit status 1.

I'm running the latest version of swe-bench, and i'm building it from the same image as what's used in the swe-agent repository. Happy to post any logs as well if you'd like to try and debug this

Steps/Code to Reproduce

    run_evaluation(
        predictions_path="{INPUT_PATH}",
        log_dir=".",
        swe_bench_tasks="princeton-nlp/SWE-bench_Lite",
        testbed=".",
        conda_link=None,
        log_suffix=None,
        skip_existing=True,
        timeout=1200,
        verbose=True,
        num_processes=4
    )

Note that I'm currently running this on a ray cluster using the source image projects/deeplearning-platform-release/global/images/family/common-cpu on an x86 machine in GCE. I've tried running this on both a single machine, and on multiple nodes sharded, and in either case I end up hitting this issue.

Ultimately, running golden patches i only get 169 resolved issues.

Expected Results

300 Resolved GT

Actual Results

169 Resolved GT

System Information

No response

ssh-randy commented 1 month ago

I've tried rerunning this on my local macbook and find similar issues as well. it's an x86 MBP 2019

trying with a docker image now to see if this will work better.

ssh-randy commented 1 month ago

Ah i'm seeing errors like these:

2024-05-14 08:06:19,824 - testbed_context_manager - ERROR - Error stderr: /bin/sh: 5: /home/swe-bench/harness/EVALS/output_gt/tb/GPT-4-Turbo/sphinx/5.0/tmphsxnpjxv/miniconda3/etc/conda/deactivate.d/deactivate-gxx_linux-64.sh: Syntax error: "(" unexpected

2024-05-14 08:06:19,837 - testbed_context_manager - ERROR - Error traceback: Traceback (most recent call last): File "/home/swe-bench/swebench/harness/context_manager.py", line 53, in call output = subprocess.run(cmd, **combined_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/subprocess.py", line 571, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '. /home/swe-bench/harness/EVALS/output_gt/tb/GPT-4-Turbo/sphinx/5.0/tmphsxnpjxv/miniconda3/bin/activate sphinx-docsphinx5.0 && conda install gxx_linux-64 gcc_linux-64 make -y' returned non-zero exit status 2.

this is running from within a docker container so curious what may cause this

john-b-yang commented 2 weeks ago

Hi @ssh-randy hmm it's a big difficult to diagnose specifically what's going on here to suggest a fix, as it doesn't look like a lot of the calls are due to SWE-bench package errors.

This error looks like something went wrong with a call to deactivating a conda environment:

Error stderr: /bin/sh: 5: /home/swe-bench/harness/EVALS/output_gt/tb/GPT-4-Turbo/sphinx/5.0/tmphsxnpjxv/miniconda3/etc/conda/deactivate.d/deactivate-gxx_linux-64.sh: Syntax error: "(" unexpected

I'm not quite sure how to resolve this or what to fix unfortunately. However, we are going to release a re-vamp to the SWE-bench evaluation harness within the next 2 weeks that 1. integrates docker containers into the evaluation process and 2. has been tested on ARM macs.

I'm closing this issue for now, as I'm not quite sure what to do here. I can ping here when we release the new eval code and we can begin a new issue thread if you're still seeing problems then, if this is all right. Thanks so much again for the detailed error messages and sorry about the inconvenience.

ivan4722 commented 2 weeks ago

DId you find a fix for this by chance? I have the same error.

princeton-nlp / SWE-bench