princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.74k stars 288 forks source link

Upper bound score by skilled human? #72

Closed paul-gauthier closed 3 weeks ago

paul-gauthier commented 5 months ago

Have you made any attempt to characterize a "skilled human upper bound" score?

I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.

I asked this question in the current HN discussion about SWE agent, but it doesn't look like you are participating there.

https://news.ycombinator.com/item?id=39910452

john-b-yang commented 4 months ago

Hi @paul-gauthier we did not have the resources to determine this number when putting forth the original SWE-bench.

Given that SWE-bench issues are collected from real pull requests that have been reviewed and accepted by human collaborators, to some degree we believe that these task instances are difficult, but not impossible as they were completed by human task workers.

However, we understand your angle, particularly how some issue descriptions could be under-specified on face value. We have ongoing work that aims to better understand how a software engineer without context or prior knowledge would perform on SWE-bench.

Thanks for the comment, it's a good point that we don't have the data for at the moment, but hope to have it soon!

paul-gauthier commented 4 months ago

I would suggest randomly sampling N of the tasks. Manually inspect the hidden test cases as compared to the issue description. Flag the issue if the tests require things not specified in the issue.

You should be able to estimate a reasonable bound after only N=~20 or so such inspections.

I did this myself informally, and found a large ratio of "unsolvable" tasks.

rawwerks commented 3 months ago

i see this as a foundational issue in the dataset. i sincerely hope the authors will take the responsibility to "do the right thing".

clearly there is a judgement call about how to define "unsolvable" and "under-specified".

the princeton nlp team has catalyzed a true movement! my fear is that in a post-devin world, we now have thousands of devs and hundreds of millions of dollars chasing 100% on a broken benchmark. it is a "mission impossible".

i have tremendous empathy for the resource limitations. clearly it doesn't make sense to redact the existing benchmark. i'm sure the next one will be better.

to be clear: my criticism is not about the presence of unsolvable tasks, but about the transparency and visibility of the limitations of this benchmark. it needs to come from the source. the authors have the responsibility to make the users aware of this.

a specific request: could we leave this issue marked open for visibility? the vast majority of the "open devin" community is unaware of this flaw in swe-bench, and it doesn't feel right to watch thousands of eager and smart people bang their heads against a wall.

(go tigers)

carlosejimenez commented 3 months ago

Hi @rawwerks, we're aware of this issue. We'll be updating this repository and future evaluations shortly with a solution that I think will be satisfying for everyone.

In the mean time, I believe that SWE-bench Lite is currently where most evaluations are being done and should not suffer from as many issues related to human solvability.

In our experiments, the best models can achieve rather high scores when given multiple tries, and I believe that the practical upper bound is much higher than the current SOTA for Lite.

I'll reopen this issue for transparency but we expect to resolve it very shortly.

moresearch commented 3 months ago

multiple tries,

Range?

psykhi commented 3 months ago

multiple tries,

Range?

They get 32% pass@6. (from the paper)

moresearch commented 3 months ago

In our experiments, the best models can achieve rather high scores when given multiple tries, and I believe that the practical upper bound is much higher than the current SOTA for Lite.

(from the paper) "we find that 93.0% of resolved instances are submitted before exhausting their cost budget. For these reasons, we suspect that increasing the maximum budget or token limit are unlikely to lead to substantial increases in performance."

@carlosejimenez Could you please clarify if there is an update on this? Thanks

john-b-yang commented 2 months ago

@moresearch what clarification are you requesting? I think the sentence is quite clear, but can you provide more context of what's unclear? Of the 12.47% instances that were resolved, 93% were submitted, while 7% were auto-submitted due to exceeding cost limit.

Also, just commenting here to indicate that we still have this issue in mind and we're actively working on this. Human validation takes some time, so we anticipate a full report should be out some time late June - early July.

zhlmmc commented 1 month ago

I would suggest randomly sampling N of the tasks. Manually inspect the hidden test cases as compared to the issue description. Flag the issue if the tests require things not specified in the issue.

You should be able to estimate a reasonable bound after only N=~20 or so such inspections.

I did this myself informally, and found a large ratio of "unsolvable" tasks.

Agreed. After inspec the datasets and the repos, I found a lot of patches in swe-bench-lite is not achievable without additonal input.

Domiii commented 1 month ago

The problem here seems to be with the dataset's inclusion criteria.

Existing Inclusion Criteria

The paper itself states very lax criteria. In Appendix A.1, it says:

Task instance construction. We construct candidate task instances from PRs that satisfy three conditions.

First, the PR’s status must be Merged. A Merged status indicates that the PR’s associated code changes were accepted and incorporated into its parent repository.

Second, the PR resolves one or more issues in its repository. An issue is defined according to its canonical usage in GitHub as a digital ticket for tracking bugs, enhancements, or any general development goals for a software project. We scan a PR’s title, body, and commit messages for linked issues (i.e. “fixes #24”).

Third, the PR must introduce one or more new tests. A new test is counted when a PR’s code changes edits a file path containing a testing-related keyword (e.g. “test”, “testing”).

A Suggestion

An issue should only be deemed "suitable" if an expert outside contributor who fully understands (i) the codebase, (ii) the docs and (iii) the issue can solve the issue.

Some supplementary inclusion/exclusion criteria:

  1. Does the solution depend on any not directly referenced communiqué, or other not publicly and directly available information?
  2. Does the solution implement an unspecified design?

Of course such a dataset is much harder to build and curate, but it certainly feels worth striving for. I'd be interested if (and where, if publicly available) you are currently working on this? That being said, I can also understand if this approach is deemed too costly for you to take on at this point in time.

Domiii commented 3 weeks ago

This issue can be closed now.

SWE-bench verified is the answer.

carlosejimenez commented 3 weeks ago

as mentioned by @Domiii, evaluation on SWE-bench Verified should resolve these concerns - where potential human upper bound should be near 100%. Closing this issue for now.

john-b-yang commented 3 weeks ago

As a follow up, if you have questions concerning SWE-bench Verified, please feel free to open a new issue, and we can address it there as this thread has grown somewhat long.