princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.47k stars 241 forks source link

Reproducer Docker image #113

Closed zygi closed 1 week ago

zygi commented 2 months ago

Describe the feature

Hi! Thanks for all the work, after the 04/15 patch I can now reproduce most of the SWE-bench instances using the default harness. However, I'm still having trouble with (at least) Flask and Scikit-Learn, where environments fail to be initialized bc of what I suspect is a Cython version mismatch. This fails even in a clean-slate Docker environment (example attached).

However, in your Repair Report (https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240415_eval_bug/README.md) you mentioned you have successfully reproduced evaluation of the whole dataset. So either I'm doing something uniquely wrong, or the process still depends on the host environment and the environment you're using is unique in some way. I'd like to figure out which one of these is the case :) It would be great if you could share more operational details about your test running process - the environment, the exact scripts or ideally even a Docker image that does it.

Thanks!


Repro of my failing attempt to set up the harness for scikit-learn:

if name == "main": insts = get_eval_refs("princeton-nlp/SWE-bench")

# only take scikit-learn for repro
insts = {k: v for (k, v) in insts.items() if v["repo"].endswith("scikit-learn")}

# simply create the context manager 
# Note: leaving both `conda_link` and `path_conda` empty to use the default logic, whatever it is 
tcm = TestbedContextManager(
    list(insts.values()),
    "/tmp/swebench_logs",
    testbed=str("/tmp/swebench_eval_dir/testbed"),
)

# just enter it and print all tasks
with tcm:
    distributed_task_list = tcm.get_distributed_tasks()
    for task_list in distributed_task_list:
        print(
            f"{task_list['testbed']}: {len(task_list['task_instances'])} instances"
        )

- Dockerfile:
```docker
FROM continuumio/miniconda3
WORKDIR /workdir

RUN git clone https://github.com/princeton-nlp/SWE-bench /workdir
RUN conda env create -f environment.yml
RUN echo "conda activate swe-bench" >> ~/.bashrc

# pre-cache the SWE-bench HF dataset to avoid re-downloading it every time
RUN conda run -n swe-bench python -c 'from swebench.metrics.getters import get_eval_refs; get_eval_refs("princeton-nlp/SWE-bench")'

COPY test_script.py test_script.py

× Encountered error while generating package metadata. ╰─> See above for output.

note: This is an issue with the package mentioned above, not pip. hint: See above for details.

2024-05-01 23:48:09,346 - testbed - ERROR - Error traceback: Traceback (most recent call last): File "/workdir/swebench/harness/context_manager.py", line 82, in call output = subprocess.run(cmd, **combined_args) File "/opt/conda/envs/swe-bench/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '. /tmp/tmpy54mprlj/miniconda3/bin/activate scikit-learnscikit-learn0.20 && pip install numpy==1.19.2 scipy==1.5.2' returned non-zero exit status 1.

Traceback (most recent call last): File "/workdir/test_script.py", line 18, in with tcm: File "/workdir/swebench/harness/context_manager.py", line 403, in enter self.exec(cmd, shell=True) File "/workdir/swebench/harness/context_manager.py", line 95, in call raise e File "/workdir/swebench/harness/context_manager.py", line 82, in call output = subprocess.run(cmd, **combined_args) File "/opt/conda/envs/swe-bench/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '. /tmp/tmpy54mprlj/miniconda3/bin/activate scikit-learnscikit-learn0.20 && pip install numpy==1.19.2 scipy==1.5.2' returned non-zero exit status 1.


</details>

### Potential Solutions

Would it be possible for you to include a full command/script that, when run on a clean environment, will set up each instance and confirm that the golden solution correctly solves it? 
gnohgnailoug commented 1 week ago

The same problem when reproducing validation step. How to solve this problem? And why this happens?