princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.8k stars 312 forks source link

Distinguish Between Verified and Unverified Solutions #138

Closed thisdotmatt closed 3 months ago

thisdotmatt commented 3 months ago

Hello,

I expect that many people, including myself, are beginning to look to the leaderboard website as a resource. I noticed that a number of the highest-scoring solutions on the leaderboard are unverified (and just as many closed source). While I am encouraged that corporations and research groups alike are interested in this field, I wonder if this lack of distinction could lead to misinformation.

I believe the leaderboards should prioritize verified solutions, either by rank or by separating the two groups entirely. In addition, it would be helpful to outline the distinction on the front page of the website.

john-b-yang commented 3 months ago

Hi @thisdotmatt thanks for the feedback, yeah this is something we're thinking about. I can definitely add a footnote to the website that provides a clarification on verified + open source.

Initially, we've been wanting to encourage submissions to the leaderboard. We're very happy with the activity, but it does potentially raise this question of whether the solutions are authentic.

So far, I'm fairly confident there hasn't been any instance of faked submissions. To back this up, here is a graphic showing (submissions X which instances they resolved). If there was a submission that simply ablated gold patches randomly to simulate resolved instances, I'd expect that row to look pretty scattered in terms of its resolves, but none of them seems to look like this.

https://x.com/jyangballin/status/1804212474121433375

We're currently thinking about whether there need to be any changes to the leaderboard submission criteria going forwards (e.g. if we get enough, separating the submissions by open source / verified). At this time, we're going to maintain our practice so far, but there may be some updates within the next 2 months.

Thanks for raising this issue, it's an important one. If/when there's updates, I'll be sure to follow up here.

john-b-yang commented 3 months ago

Just added a couple notes to the website that provides some clarification about what "verified" and "open" mean.

thisdotmatt commented 3 months ago

Thanks John, I appreciate the transparency and the update to the website. The graph you shared is very interesting - I look forward to seeing it filled up as the field develops.