Closed bytesuji closed 2 weeks ago
Tried running the same evaluation using SWE-agent's run_eval.sh script and got this:
❌ Evaluation failed: Command '. /home/ubuntu/SWE-agent/evaluation/testbed/predictions/sphinx-doc__sphinx/4.1/tmpe4u_b189/miniconda3/bin/activate sphinx-doc__sphinx__4.1 &&
conda install gxx_linux-64 gcc_linux-64 make -y' returned non-zero exit status 2.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/site-packages/swebench/harness/engine_evaluation.py", line 167, in main
setup_testbed(data_groups[0])
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/site-packages/swebench/harness/engine_validation.py", line 90, in setup_testbed
with TestbedContextManager(
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/site-packages/swebench/harness/context_manager.py", line 364, in __enter__
self.exec(cmd, shell=True)
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/site-packages/swebench/harness/context_manager.py", line 59, in __call__
raise e
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/site-packages/swebench/harness/context_manager.py", line 51, in __call__
output = subprocess.run(cmd, **combined_args)
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '. /home/ubuntu/SWE-agent/evaluation/testbed/predictions/sphinx-doc__sphinx/4.1/tmpe4u_b189/miniconda3/bin/activate
sphinx-doc__sphinx__4.1 && conda install gxx_linux-64 gcc_linux-64 make -y' returned non-zero exit status 2.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/SWE-agent/evaluation/evaluation.py", line 59, in main
run_evaluation(
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/site-packages/swebench/harness/run_evaluation.py", line 203, in main
pool.map(eval_engine, eval_args)
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
subprocess.CalledProcessError: Command '. /home/ubuntu/SWE-agent/evaluation/testbed/predictions/sphinx-doc__sphinx/4.1/tmpe4u_b189/miniconda3/bin/activate
sphinx-doc__sphinx__4.1 && conda install gxx_linux-64 gcc_linux-64 make -y' returned non-zero exit status 2.
==================================
Log directory for evaluation run: results/predictions
== Evaluation Report ==
{'# Not Generated': 0, '# Generated': 20, '# Applied': 19, '# Resolved': 10, '# Install Fail': 4}
- Wrote per-instance scorecards to /home/ubuntu/OpenDevin/evaluation/SWE-bench/data/predictions/scorecards.json
- Wrote summary of run to /home/ubuntu/OpenDevin/evaluation/SWE-bench/data/predictions/results.json
Reference Report:
{'# Not Generated': 0, '# Generated': 20, '# Applied': 15, '# Resolved': 10, '# Install Fail': 4}
Exception ignored in: <function Pool.__del__ at 0x7f3e2d52b430>
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/pool.py", line 268, in __del__
self._change_notifier.put(None)
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/home/ubuntu/miniconda3/envs/swe-agent/lib/python3.9/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor
Hi @bytesuji can you try running evaluation again?
We have observed that a lot of people are facing similar challenges when it comes to setting up SWE-bench evaluation. We spent the last couple weeks implementing + verifying a lot of fixes and wrote it up in a report here.
The failures you're getting with sphinx not installing looks very reminiscent of the failure modes we identify.
We'll also release the predictions + execution logs via the website later today and you can use those to check your logs against what we're seeing.
Closing due to inactivity. Thansk once again for the detailed issue report!
Just in case, the SWE-bench/experiments repository contains all the logs for validation of task instances + evaluation of model generations.
We're also planning on releasing a dockerized version of the SWE-bench evaluation harness that has been tested quite rigorously soon. If you're still interested in working on SWE-bench, I'd definitely encourage checking out the release later this week! 😄
I am attempting to run swe-bench on a small 20-sample subset of the test dataset. The instance IDs in question are:
I ran the evaluation using OpenDevin's Dockerfile and noticed a number of build issues in the logs including the one mentioned in issue #57. The final report obtained by the metric script is shown below.
Why is it the case that, for the gold patches provided by the dataset itself, only 17 were able to be applied and zero were counted as resolved?