princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.45k stars 240 forks source link

Clarification Needed on Removal of Instances with Error Message Checks in SWE-bench Lite Dataset #128

Closed ramsey-coding closed 2 weeks ago

ramsey-coding commented 1 month ago

Describe the issue

In the SWE-bench lite dataset documentation, it is mentioned:

I do not understand what this statement refers to.

Every instance in the SWE-bench dataset consists of an issue and a test for validation.

Could you please clarify what is meant by removing instances that contain tests with error message checks?

Thank you for your assistance.

Suggest an improvement to documentation

No response

john-b-yang commented 2 weeks ago

Ah yes no problem, I'm assuming you're referring to one of the bullet points on the SWE-bench Lite info page.

This refers to task instances with corresponding fail to pass (f2p) tests that verify whether a particular error was thrown with a specific message. A pseudocode example:

def unit_test_1(...):
    [Some code here]
    assert ValueError was thrown with the specific message "You provided the wrong value"

Which implies that the code fix must have introduced some case where:

def tested_function():
    [Some code here]
    raise ValueError("You provided the wrong value")

The reason we removed these is that it is very difficult to get the specific string error message correct. However, based on the conventions of the codebase, a human task worker is certainly still capable of inferring the format of the string error message. These types of issues make up a small portion of the full SWE-bench test set. You can see the logic for identifying + filtering out such issues here.

Hope this helps! Thanks for the great question.

john-b-yang commented 2 weeks ago

Marking this as complete, but please feel free to continue the thread with any follow up questions.