princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.45k stars 240 forks source link

Containerize SWE-bench evaluation #142

Closed carlosejimenez closed 1 week ago

carlosejimenez commented 1 week ago

This revamps the swebench.harness module to use fully containerized instances for evaluation. See this report for further details.

Reference Issues/PRs

This change should address the following issues: closes #114, closes #113, closes #141, closes #77, closes #104, closes #136

Any other comments?

Thanks to OpenAI's preparedness team (including Oliver Jaffe, Chan Jun Shern, James Aung, Giulio Starace, Dane Sherburn, and Neil Chowdhury) for initiating this.

Thanks to @aorwall for their suggestion regarding this. And thanks to Cognition Labs for providing some inspiration for these changes too.

john-b-yang commented 1 week ago

LGTM! 💯 Thanks again to all the contributors 🙏🏼