princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.79k stars 310 forks source link

Confused about the usage of fields `test_patch`, `PASS_TO_PASS` and `FAIL_TO_PASS` #174

Closed DavdGao closed 2 months ago

DavdGao commented 2 months ago

Describe the issue

  1. What's the use of test_patch? Is it used in evaluation and before executing the unit tests?
  2. Can test_patch, PASS_TO_PASS and FAIL_TO_PASS used in the solution? That is, should the failed unit tests provided as input?
DavdGao commented 2 months ago

The reason I ask these question is that without information in failed unit tests, an issue may correspond to different modifications?

Take the following problem statement (instance_id=django__django-15202) as an example

URLField throws ValueError instead of ValidationError on clean
Description

forms.URLField( ).clean('////]@N.AN')
results in:
    ValueError: Invalid IPv6 URL
    Traceback (most recent call last):
     File "basic_fuzzer.py", line 22, in TestOneInput
     File "fuzzers.py", line 350, in test_forms_URLField
     File "django/forms/fields.py", line 151, in clean
     File "django/forms/fields.py", line 136, in run_validators
     File "django/core/validators.py", line 130, in __call__
     File "urllib/parse.py", line 440, in urlsplit

The following modification can be applied to both line 130 in django/core/validators.py and line 136 in django/forms/fields.py?

try: 
    # ...
expect ValueError:
    raise ValidationError("Invalid IPv6 URL")
john-b-yang commented 2 months ago

@DavdGao the SWE-bench paper answers all the questions you asked.

  1. test_patch contains any modifications to tests introduced by the original PR, they usually contain either new tests or updates to existing ones.
  2. No, they cannot be used in the process to generate the solution. Only the codebase at the base_commit and the issue description can be used.

For the situation you presented, yes, both of those could be valid solutions. The model generated one does not have to exactly match the solution, it is possible to write a novel, distinct solution that still resolves the issue.