princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.47k stars 241 forks source link

Dataset field & set up reliable environment #99

Closed Hodge931 closed 2 weeks ago

Hodge931 commented 2 months ago

Describe the issue

Are texts corresponding to the field patch guaranteed to be correct answers in princeton-nlp/SWE-bench_Lite_oracle? In other words, to check if my evaluation environments are reliable, should I ensure that the patch in the field patch of every example passes all test cases? or I just ensure the model_patch here passes all test cases? Thanks a lot!

Suggest an improvement to documentation

No response

john-b-yang commented 2 weeks ago

Hi @Hodge931, thanks for the question.

to check if my evaluation environments are reliable, should I ensure that the patch in the field patch of every example passes all test cases?

Yep! This is correct. We provide the reference solution (the patch field), so if you run SWE-bench evaluation where all of the "predictions" are just patch, you should get 100% passing.

With that said, we have had some inconsistencies related to SWE-bench evaluation on different machines. We're going to be releasing a re-vamped SWE-bench evaluation harness w/ Docker containers incorporated in the logic to eliminate a lot of the inconsistencies arising from running evaluation directly on different machines.

Closing this as completed. Please feel free to re-open this issue or create a new one for any follow up questions! Thanks again.