Unable to replicate basic results

bytesuji commented 3 months ago

I am attempting to run swe-bench on a small 20-sample subset of the test dataset. The instance IDs in question are:

django__django-11299
django__django-11618
django__django-12148
django__django-13347
django__django-14109
django__django-14334
django__django-15572
django__django-16873
matplotlib__matplotlib-23562
psf__requests-2873
pylint-dev__pylint-6556
scikit-learn__scikit-learn-10377
scikit-learn__scikit-learn-11315
scikit-learn__scikit-learn-12938
sphinx-doc__sphinx-9260
sympy__sympy-12881
sympy__sympy-13744
sympy__sympy-15542
sympy__sympy-15599

I ran the evaluation using OpenDevin's Dockerfile and noticed a number of build issues in the logs including the one mentioned in issue #57. The final report obtained by the metric script is shown below.

gold_patch_test Evaluation Report:
        None:      0
        Generated: 20
        With Logs: 20
        Applied:   17
        Resolved:  0

Why is it the case that, for the gold patches provided by the dataset itself, only 17 were able to be applied and zero were counted as resolved?

bytesuji commented 3 months ago

Tried running the same evaluation using SWE-agent's run_eval.sh script and got this:

❌ Evaluation failed: Command '. /home/ubuntu/SWE-agent/evaluation/testbed/predictions/sphinx-doc__sphinx/4.1/tmpe4u_b189/miniconda3/bin/activate sphinx-doc__sphinx__4.1 && 
conda install gxx_linux-64 gcc_linux-64 make -y' returned non-zero exit status 2.
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/site-packages/swebench/harness/engine_evaluation.py", line 167, in main
    setup_testbed(data_groups[0])
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/site-packages/swebench/harness/engine_validation.py", line 90, in setup_testbed
    with TestbedContextManager(
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/site-packages/swebench/harness/context_manager.py", line 364, in __enter__
    self.exec(cmd, shell=True)
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/site-packages/swebench/harness/context_manager.py", line 59, in __call__
    raise e
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/site-packages/swebench/harness/context_manager.py", line 51, in __call__
    output = subprocess.run(cmd, **combined_args)
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '. /home/ubuntu/SWE-agent/evaluation/testbed/predictions/sphinx-doc__sphinx/4.1/tmpe4u_b189/miniconda3/bin/activate 
sphinx-doc__sphinx__4.1 && conda install gxx_linux-64 gcc_linux-64 make -y' returned non-zero exit status 2.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/SWE-agent/evaluation/evaluation.py", line 59, in main
    run_evaluation(
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/site-packages/swebench/harness/run_evaluation.py", line 203, in main
    pool.map(eval_engine, eval_args)
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
subprocess.CalledProcessError: Command '. /home/ubuntu/SWE-agent/evaluation/testbed/predictions/sphinx-doc__sphinx/4.1/tmpe4u_b189/miniconda3/bin/activate 
sphinx-doc__sphinx__4.1 && conda install gxx_linux-64 gcc_linux-64 make -y' returned non-zero exit status 2.

==================================
Log directory for evaluation run: results/predictions
== Evaluation Report ==
{'# Not Generated': 0, '# Generated': 20, '# Applied': 19, '# Resolved': 10, '# Install Fail': 4}
- Wrote per-instance scorecards to /home/ubuntu/OpenDevin/evaluation/SWE-bench/data/predictions/scorecards.json
- Wrote summary of run to /home/ubuntu/OpenDevin/evaluation/SWE-bench/data/predictions/results.json
Reference Report:
{'# Not Generated': 0, '# Generated': 20, '# Applied': 15, '# Resolved': 10, '# Install Fail': 4}
Exception ignored in: <function Pool.__del__ at 0x7f3e2d52b430>
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/queues.py", line 377, in put
    self._writer.send_bytes(obj)
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

john-b-yang commented 2 months ago

Hi @bytesuji can you try running evaluation again?

We have observed that a lot of people are facing similar challenges when it comes to setting up SWE-bench evaluation. We spent the last couple weeks implementing + verifying a lot of fixes and wrote it up in a report here.

The failures you're getting with sphinx not installing looks very reminiscent of the failure modes we identify.

We'll also release the predictions + execution logs via the website later today and you can use those to check your logs against what we're seeing.

john-b-yang commented 2 weeks ago

Closing due to inactivity. Thansk once again for the detailed issue report!

Just in case, the SWE-bench/experiments repository contains all the logs for validation of task instances + evaluation of model generations.

We're also planning on releasing a dockerized version of the SWE-bench evaluation harness that has been tested quite rigorously soon. If you're still interested in working on SWE-bench, I'd definitely encourage checking out the release later this week! 😄

princeton-nlp / SWE-bench

Unable to replicate basic results #74