princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.45k stars 240 forks source link

Skipped test cases #139

Closed Hodge931 closed 1 week ago

Hodge931 commented 1 week ago

Describe the issue

By this line: https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/metrics/getters.py#L122

Does it mean if in the evaluation of an instance, one skipped test case is not in FAIL_TO_PASS or PASS_TO_PASS category, then the instance is considered as not resolved?

Suggest an improvement to documentation

No response

john-b-yang commented 1 week ago

@Hodge931 that is correct. If the test doesn't show up, we assume the outcome to be a fail.