Dataset field & set up reliable environment

princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?

MIT License

1.47k stars 241 forks source link

Hi @Hodge931, thanks for the question.

to check if my evaluation environments are reliable, should I ensure that the patch in the field patch of every example passes all test cases?

Yep! This is correct. We provide the reference solution (the patch field), so if you run SWE-bench evaluation where all of the "predictions" are just patch, you should get 100% passing.

With that said, we have had some inconsistencies related to SWE-bench evaluation on different machines. We're going to be releasing a re-vamped SWE-bench evaluation harness w/ Docker containers incorporated in the logic to eliminate a lot of the inconsistencies arising from running evaluation directly on different machines.

Closing this as completed. Please feel free to re-open this issue or create a new one for any follow up questions! Thanks again.

princeton-nlp / SWE-bench

Dataset field & set up reliable environment #99

Describe the issue

Suggest an improvement to documentation