princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.81k stars 311 forks source link

modify metrics #103

Closed icoderzqliu closed 3 months ago

icoderzqliu commented 5 months ago

Reference Issues/PRs

What does this implement/fix? Explain your changes.

In the current evaluation logic, whenever either pred_try or pred_minimal_try fails, it will be judged as no apply, which is inconsistent with the purpose of evaluation (as long as there is a successful apply, subsequent evaluation will continue). I modified it to treat it as no apply only if neither of them is successfully applied.

Any other comments?

🧡 Thanks for contributing!

john-b-yang commented 3 months ago

@icoderzqliu thanks so much for this catch! Apologies it took a while to get around to this. Just merged, and will update any logs in SWE-bench/experiments that were generated with the older script.