Automatic benchmarking of gpt-engineer with swe-bench

gpt-engineer-org / gpt-engineer

Platform to experiment with the AI Software Engineer. Terminal based. NOTE: Very different from https://gptengineer.app

MIT License

52.02k stars 6.77k forks source link

Automatic benchmarking of gpt-engineer with swe-bench #913

Open AntonOsika opened 9 months ago

AntonOsika commented 9 months ago

Feature description

We have a way to easily add benchmarks:

https://www.loom.com/share/206805143fbb4302b5455a5329eaab17?sid=f689608f-8e49-44f7-b55f-4c81e9dc93e6

This issue is about looking into if swe-bench is a good benchmark to add and then add a simple version of it.

ErikBjare commented 6 months ago

Tempted to prioritize this higher after the Devin announcement (just as @batwood001 in #1062).

viborc commented 6 months ago

Makes sense. Let's figure it out this Thursday at our tech planning meeting and the availability of people.

Mohit-Dhawan98 commented 5 months ago

@viborc can you assign this to me?

viborc commented 5 months ago

@viborc can you assign this to me?

Done!

viborc commented 4 months ago

This is more of a general update to the community than anything else. The work on this issue is ongoing, and @Mohit-Dhawan98 is working on it with @ATheorell's support. We'll likely have SWE bench support in the near future!

viborc commented 2 months ago

Someone from the OpenDevin suggested we might look into their work here and possibly learn from it and re-use if needed. Putting this here for our reference: https://github.com/OpenDevin/OpenDevin/tree/main/evaluation/swe_bench