This revamps the swebench.harness module to use fully containerized instances for evaluation.
See this report for further details.
Reference Issues/PRs
This change should address the following issues:
closes #114, closes #113, closes #141, closes #77, closes #104, closes #136
Any other comments?
Thanks to OpenAI's preparedness team (including Oliver Jaffe, Chan Jun Shern, James Aung, Giulio Starace, Dane Sherburn, and Neil Chowdhury) for initiating this.
Thanks to @aorwall for their suggestion regarding this.
And thanks to Cognition Labs for providing some inspiration for these changes too.
This revamps the swebench.harness module to use fully containerized instances for evaluation. See this report for further details.
Reference Issues/PRs
This change should address the following issues: closes #114, closes #113, closes #141, closes #77, closes #104, closes #136
Any other comments?
Thanks to OpenAI's preparedness team (including Oliver Jaffe, Chan Jun Shern, James Aung, Giulio Starace, Dane Sherburn, and Neil Chowdhury) for initiating this.
Thanks to @aorwall for their suggestion regarding this. And thanks to Cognition Labs for providing some inspiration for these changes too.