Dockerization of run_evaluation.py

princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?

https://www.swebench.com

MIT License

1.47k stars 241 forks source link

Dockerization of run_evaluation.py #114

Closed aorwall closed 1 week ago

aorwall commented 2 months ago

Describe the feature

I've been working on building Docker images for all testbeds used in SWE-Bench. This works quite well even if I still haven't got failing 18 benchmark instances when I verify against the golden patches in SWE-Bench Lite. But could be interesting to collaborate on this as it might be a more stable and performant solution than using only conda environments.

Potential Solutions

Check out this repo where I pushed all Dockerfiles, a simplified version of the TaskEnvContextManager I use inside the Docker container and some scripts to run it all. https://github.com/aorwall/SWE-bench-docker

aorwall commented 1 month ago

I'm down to 2 failing tests now in pydata/xarray 0.12. I probably need to compare to logs from a successful run to fix those effectively.

I'm also testing testbeds for the regular dataset using the check harness predictions now.

paul-gauthier commented 1 month ago

I'll chime in that @aorwall's docker images and run_evaluation.py script have worked very well for me. I was able to run ~all the "lite" tests without problems. Whereas working with the original conda testbeds, most tests of the gold patches were failing to build or pass.

Also, the docker testbeds launch and execute very quickly compared to re-building the conda testbeds.

PandelisZ commented 1 month ago

~all the lite meaning, not quite all? I've been struggling to get much to run

aorwall commented 1 month ago

I got all except for pydata__xarray-4094 and pydata__xarray-4493 to run.

paul-gauthier commented 1 month ago

@PandelisZ sorry, I should have been more clear. I got 298 out of 300 test cases to work out of the box with @aorwall's dockerized SWE-bench-docker tooling. The 2 that fail are known not to work, so that was expected.

I only got a few of the test cases to work with the original/official conda test beds, after a half a day of trying.