princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.45k stars 240 forks source link

It seems that current evaluation does not handle the apply failure case? #154

Open Hodge931 opened 6 days ago

Hodge931 commented 6 days ago

Describe the issue

As titled, Thanks!

Suggest an improvement to documentation

No response

john-b-yang commented 2 days ago

It should be recorded in the run_instance.log file via the logic here, is that what you're referring to?

The report generated no longer explicitly prints the number of instances where the apply patch failed (it is included in the count of # of instances that were not resolved). However, the number of failed patch applies should be recoverable from parsing the logs (i.e. looking for whether the APPLY_PATCH_FAIL string shows up).