irthomasthomas / undecidability

13 stars 2 forks source link

Introducing SWE-bench Verified | OpenAI #933

Open ShellLM opened 2 weeks ago

ShellLM commented 2 weeks ago

Introducing SWE-bench Verified | OpenAI

Snippet

"Introducing SWE-bench Verified We're releasing a human-validated subset of SWE-bench that more reliably evaluates AI models' ability to solve real-world software issues. Download SWE-bench Verified(opens in a new window)"

Background on SWE-bench

Each sample in the SWE-bench test set is created from a resolved GitHub issue in one of 12 open-source Python repositories on GitHub. Each sample has an associated pull request (PR), which includes both the solution code and unit tests to verify code correctness. These unit tests fail before the solution code in the PR is added, but pass afterwards, and are therefore called FAIL_TO_PASS tests. Each sample also has associated PASS_TO_PASS tests, which pass both before and after the PR is merged, and are used to check that existing unrelated functionality in the codebase has not been broken by the PR.

For each sample in SWE-bench, agents are provided with the original text from the GitHub issue, known as the problem statement, and are given access to the codebase. Given these, agents must edit the files in the codebase to resolve the issue. The tests are not shown to the agent.

A proposed edit is evaluated by running both the FAIL_TO_PASS and PASS_TO_PASS tests. If the FAIL_TO_PASS tests pass, this means the edit solves the issue. If the PASS_TO_PASS tests pass, then the edit has not inadvertently broken unrelated sections of the codebase. Both sets of tests are required to pass for the edit to fully resolve the original GitHub issue.

Adapting SWE-bench as a Preparedness Evaluation

Given the potential relevance of SWE-bench for the Preparedness Framework, we aimed to find ways in which we could improve the robustness and reliability of the benchmark. We identified three major areas for improvement2:

  1. The unit tests used to evaluate the correctness of a solution are often overly specific, and in some cases are even unrelated to the issue. This potentially causes correct solutions to be rejected.
  2. Many samples have an issue description that is underspecified, leading to ambiguity on what the problem is and how it should be solved.
  3. It is sometimes difficult to reliably set up the SWE-bench development environments for the agents, inadvertently causing unit tests to fail regardless of the solution. In such cases, perfectly valid solutions might be graded as incorrect.

SWE-bench Verified

To address these issues, we launched a human annotation campaign with professional software developers to screen each sample of the SWE-bench test set for appropriately scoped unit tests and well-specified issue descriptions.

Together with the authors of SWE-bench, we are releasing SWE-bench Verified: a subset of the original test set from SWE-bench, consisting of 500 samples verified to be non-problematic by our human annotators. This version supersedes the original SWE-bench and SWE-bench Lite test sets. Additionally, we are releasing our human annotations for all SWE-bench test samples.

We also collaborated with the SWE-bench authors to develop a new evaluation harness for SWE-bench⁠(opens in a new window) which uses containerized Docker environments to make evaluating on SWE-bench easier and more reliable.

On SWE-bench Verified, GPT-4o resolves 33.2% of samples3, with the best performing open-source scaffold, Agentless, doubling its previous score of 16% on SWE-bench.

Our Approach

We worked with 93 software developers experienced in Python to manually screen SWE-bench samples for quality. We annotated 1,699 random samples from the SWE-bench test set to produce SWE-bench Verified. The following analysis is based on the 1,699 samples.

We annotate samples to capture:

  1. Whether we consider the issue description to be underspecified and hence unfair to be testing on.
  2. Whether the FAIL_TO_PASS unit tests filter out valid solutions.
  3. Each annotation criterion has a label ranging [0, 1, 2, 3] in increasing severity. Labels 0 and 1 are minor; labels 2 and 3 are severe and indicate that the sample is inadequate in some way and should be discarded.

Additionally, we rate the difficulty of each sample by having annotators estimate how long it would take for a developer to decide upon and implement the solution, assuming the sample is non-problematic. Finally, we provide a freeform input option to flag any other major issues with the sample.

Annotation Results

We see that 38.3% of samples were flagged for underspecified problem statements, and 61.1% were flagged for unit tests that may unfairly mark valid solutions as incorrect. Overall, our annotation process resulted in 68.3% of SWE-bench samples being filtered out due to underspecification, unfair unit tests, or other issues.

Performance on SWE-bench Verified

We found that GPT-4o's performance on the best-performing scaffold reaches 33.2% on SWE-bench Verified, more than doubling its score of 16% on the original SWE-bench. In general, this validates our initial suspicion that the original SWE-bench dataset underestimates agent abilities.

Discussion & Limitations

We believe in an empirical and scientific approach to tracking and protecting against catastrophic risk. Building and continually improving evaluations is a key element of this work. There remains much to be done, and we're eager to see more work from the community in contributing valuable benchmarks like SWE-bench.

Data downloads

Authors

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, Aleksander Madry

Acknowledgements

We're grateful to Carlos Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan for developing the original SWE-bench benchmark; the Preparedness team for supporting this work; Tao Lin, who initially pointed out many of these issues; Ian Kivlichan and Sarah Schwettmann for feedback on an earlier version of this manuscript; and the many human annotators who helped create SWE-bench Verified.

Suggested labels

None

ShellLM commented 2 weeks ago

Related content

908 similarity score: 0.91

758 similarity score: 0.88

812 similarity score: 0.88

915 similarity score: 0.88

887 similarity score: 0.86

309 similarity score: 0.85