Closed Hodge931 closed 2 weeks ago
Hi @Hodge931, thanks for the question.
to check if my evaluation environments are reliable, should I ensure that the patch in the field patch of every example passes all test cases?
Yep! This is correct. We provide the reference solution (the patch
field), so if you run SWE-bench evaluation where all of the "predictions" are just patch
, you should get 100% passing.
With that said, we have had some inconsistencies related to SWE-bench evaluation on different machines. We're going to be releasing a re-vamped SWE-bench evaluation harness w/ Docker containers incorporated in the logic to eliminate a lot of the inconsistencies arising from running evaluation directly on different machines.
Closing this as completed. Please feel free to re-open this issue or create a new one for any follow up questions! Thanks again.
Describe the issue
Are texts corresponding to the field
patch
guaranteed to be correct answers in princeton-nlp/SWE-bench_Lite_oracle? In other words, to check if my evaluation environments are reliable, should I ensure that the patch in the fieldpatch
of every example passes all test cases? or I just ensure themodel_patch
here passes all test cases? Thanks a lot!Suggest an improvement to documentation
No response