Containerize SWE-bench evaluation

This revamps the swebench.harness module to use fully containerized instances for evaluation. See this report for further details.

Reference Issues/PRs

This change should address the following issues: closes #114, closes #113, closes #141, closes #77, closes #104, closes #136

Any other comments?

Thanks to OpenAI's preparedness team (including Oliver Jaffe, Chan Jun Shern, James Aung, Giulio Starace, Dane Sherburn, and Neil Chowdhury) for initiating this.

Thanks to @aorwall for their suggestion regarding this. And thanks to Cognition Labs for providing some inspiration for these changes too.

princeton-nlp / SWE-bench

Containerize SWE-bench evaluation #142

Reference Issues/PRs

Any other comments?