princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.45k stars 240 forks source link

what's the difference between environment_setup_commit and base_commit? #125

Closed ramsey-coding closed 1 month ago

ramsey-coding commented 1 month ago

Describe the issue

I see this:

    {
        "repo": "pytest-dev/pytest",
        "instance_id": "pytest-dev__pytest-5227",
        "base_commit": "2051e30b9b596e944524ccb787ed20f9f5be93e3",
        "patch": "diff --git a/src/_pytest/logging.py b/src/_pytest/logging.py\n--- a/src/_pytest/logging.py\n+++ b/src/_pytest/logging.py\n@@ -15,7 +15,7 @@\n from _pytest.config import create_terminal_writer\n from _pytest.pathlib import Path\n \n-DEFAULT_LOG_FORMAT = \"%(filename)-25s %(lineno)4d %(levelname)-8s %(message)s\"\n+DEFAULT_LOG_FORMAT = \"%(levelname)-8s %(name)s:%(filename)s:%(lineno)d %(message)s\"\n DEFAULT_LOG_DATE_FORMAT = \"%H:%M:%S\"\n \n \n",
        "test_patch": "diff --git a/testing/logging/test_reporting.py b/testing/logging/test_reporting.py\n--- a/testing/logging/test_reporting.py\n+++ b/testing/logging/test_reporting.py\n@@ -248,7 +248,7 @@ def test_log_cli():\n             [\n                 \"test_log_cli_enabled_disabled.py::test_log_cli \",\n                 \"*-- live log call --*\",\n-                \"test_log_cli_enabled_disabled.py* CRITICAL critical message logged by test\",\n+                \"CRITICAL *test_log_cli_enabled_disabled.py* critical message logged by test\",\n                 \"PASSED*\",\n             ]\n         )\n@@ -282,7 +282,7 @@ def test_log_cli(request):\n     result.stdout.fnmatch_lines(\n         [\n             \"test_log_cli_default_level.py::test_log_cli \",\n-            \"test_log_cli_default_level.py*WARNING message will be shown*\",\n+            \"WARNING*test_log_cli_default_level.py* message will be shown*\",\n         ]\n     )\n     assert \"INFO message won't be shown\" not in result.stdout.str()\n@@ -523,7 +523,7 @@ def test_log_1(fix):\n     )\n     assert (\n         re.search(\n-            r\"(.+)live log teardown(.+)\\n(.+)WARNING(.+)\\n(.+)WARNING(.+)\",\n+            r\"(.+)live log teardown(.+)\\nWARNING(.+)\\nWARNING(.+)\",\n             result.stdout.str(),\n             re.MULTILINE,\n         )\n@@ -531,7 +531,7 @@ def test_log_1(fix):\n     )\n     assert (\n         re.search(\n-            r\"(.+)live log finish(.+)\\n(.+)WARNING(.+)\\n(.+)WARNING(.+)\",\n+            r\"(.+)live log finish(.+)\\nWARNING(.+)\\nWARNING(.+)\",\n             result.stdout.str(),\n             re.MULTILINE,\n         )\n@@ -565,7 +565,7 @@ def test_log_cli(request):\n     # fnmatch_lines does an assertion internally\n     result.stdout.fnmatch_lines(\n         [\n-            \"test_log_cli_level.py*This log message will be shown\",\n+            \"*test_log_cli_level.py*This log message will be shown\",\n             \"PASSED\",  # 'PASSED' on its own line because the log message prints a new line\n         ]\n     )\n@@ -579,7 +579,7 @@ def test_log_cli(request):\n     # fnmatch_lines does an assertion internally\n     result.stdout.fnmatch_lines(\n         [\n-            \"test_log_cli_level.py* This log message will be shown\",\n+            \"*test_log_cli_level.py* This log message will be shown\",\n             \"PASSED\",  # 'PASSED' on its own line because the log message prints a new line\n         ]\n     )\n@@ -615,7 +615,7 @@ def test_log_cli(request):\n     # fnmatch_lines does an assertion internally\n     result.stdout.fnmatch_lines(\n         [\n-            \"test_log_cli_ini_level.py* This log message will be shown\",\n+            \"*test_log_cli_ini_level.py* This log message will be shown\",\n             \"PASSED\",  # 'PASSED' on its own line because the log message prints a new line\n         ]\n     )\n",
        "problem_statement": "Improve default logging format\nCurrently it is:\r\n\r\n> DEFAULT_LOG_FORMAT = \"%(filename)-25s %(lineno)4d %(levelname)-8s %(message)s\"\r\n\r\nI think `name` (module name) would be very useful here, instead of just the base filename.\r\n\r\n(It might also be good to have the relative path there (maybe at the end), but it is usually still very long (but e.g. `$VIRTUAL_ENV` could be substituted therein))\r\n\r\nCurrently it would look like this:\r\n```\r\nutils.py                   114 DEBUG    (0.000) SELECT \"app_url\".\"id\", \"app_url\".\"created\", \"app_url\".\"url\" FROM \"app_url\" WHERE \"app_url\".\"id\" = 2; args=(2,)\r\nmultipart.py               604 DEBUG    Calling on_field_start with no data\r\n```\r\n\r\n\r\nUsing `DEFAULT_LOG_FORMAT = \"%(levelname)-8s %(name)s:%(filename)s:%(lineno)d %(message)s\"` instead:\r\n\r\n```\r\nDEBUG    django.db.backends:utils.py:114 (0.000) SELECT \"app_url\".\"id\", \"app_url\".\"created\", \"app_url\".\"url\" FROM \"app_url\" WHERE \"app_url\".\"id\" = 2; args=(2,)\r\nDEBUG    multipart.multipart:multipart.py:604 Calling on_field_start with no data\r\n```\n",
        "hints_text": "",
        "created_at": "2019-05-07T20:27:24Z",
        "version": "4.4",
        "FAIL_TO_PASS": "[\"testing/logging/test_reporting.py::test_log_cli_enabled_disabled[True]\", \"testing/logging/test_reporting.py::test_log_cli_default_level\", \"testing/logging/test_reporting.py::test_sections_single_new_line_after_test_outcome\"]",
        "PASS_TO_PASS": "[\"[100%]\", \"[\", \"[100%]------------------------------\", \"testing/logging/test_reporting.py::test_live_logging_suspends_capture[True]\", \"testing/logging/test_reporting.py::test_live_logging_suspends_capture[False]\", \"testing/logging/test_reporting.py::test_nothing_logged\", \"testing/logging/test_reporting.py::test_messages_logged\", \"testing/logging/test_reporting.py::test_root_logger_affected\", \"testing/logging/test_reporting.py::test_log_cli_level_log_level_interaction\", \"testing/logging/test_reporting.py::test_setup_logging\", \"testing/logging/test_reporting.py::test_teardown_logging\", \"testing/logging/test_reporting.py::test_disable_log_capturing\", \"testing/logging/test_reporting.py::test_disable_log_capturing_ini\", \"testing/logging/test_reporting.py::test_log_cli_enabled_disabled[False]\", \"testing/logging/test_reporting.py::test_log_cli_default_level_multiple_tests\", \"testing/logging/test_reporting.py::test_log_cli_default_level_sections\", \"testing/logging/test_reporting.py::test_live_logs_unknown_sections\", \"testing/logging/test_reporting.py::test_log_cli_level\", \"testing/logging/test_reporting.py::test_log_cli_ini_level\", \"testing/logging/test_reporting.py::test_log_cli_auto_enable[]\", \"testing/logging/test_reporting.py::test_log_cli_auto_enable[--log-level=WARNING]\", \"testing/logging/test_reporting.py::test_log_cli_auto_enable[--log-file-level=WARNING]\", \"testing/logging/test_reporting.py::test_log_cli_auto_enable[--log-cli-level=WARNING]\", \"testing/logging/test_reporting.py::test_log_file_cli\", \"testing/logging/test_reporting.py::test_log_file_cli_level\", \"testing/logging/test_reporting.py::test_log_level_not_changed_by_default\", \"testing/logging/test_reporting.py::test_log_file_ini\", \"testing/logging/test_reporting.py::test_log_file_ini_level\", \"testing/logging/test_reporting.py::test_log_file_unicode\", \"testing/logging/test_reporting.py::test_collection_live_logging\", \"testing/logging/test_reporting.py::test_collection_logging_to_file\", \"testing/logging/test_reporting.py::test_log_in_hooks\", \"testing/logging/test_reporting.py::test_log_in_runtest_logreport\", \"testing/logging/test_reporting.py::test_log_set_path\"]",
        "environment_setup_commit": "4ccaa987d47566e3907f2f74167c4ab7997f622f"
    }

What's the difference between environment_setup_commit and base_commit ?

Suggest an improvement to documentation

No response

PandelisZ commented 1 month ago

From my understanding environment setup is where the testbeds will check out to to install deps and base_commit is where the patch is being applied

klieret commented 1 month ago

(Not an expert on swe-bench, so take this with a grain of alt), but I think the idea with installing the deps was that you mostly default to the latest release. So environment_setup_commit would probably point to the latest release commit before the gold patch merge and base_commit to the main branch parent commit of the gold patch merge commit.

nora-doe commented 3 weeks ago

@klieret this doesn't seem to be true?

there are many instances in the dataset where where the environment_setup_commit is dated after the gold patch merge.

for example, pydicom__pydicom-897:

how is the environment_setup_commit determined for each task?

klieret commented 3 weeks ago

Interesting. But the only point of the environment_setup_commit is to make sure the package installs, so in principle it could be an arbitrary commit, as it is usually not related to the task itself. Since the idea was mostly to use releases for the setup commits, and any changes of the installation instructions happen at some point in between the releases, it seems reasonable that some tasks pinned the following release rather than the previous one.

But let me ping @john-b-yang @carlosejimenez who know for sure

huyouare commented 3 days ago

@klieret did you get an answer?

john-b-yang commented 2 days ago

Ah ok so the environment_setup_commit serves the exact purpose described by @PandelisZ and @klieret.

The question that seems to remain is how this commit was actually selected, as pointed out by @nora-doe. I'll preface this by saying this is a strategy that worked for us empirically, and there's some rationale behind it, but there certainly may be better strategies.

The environment_setup_commit corresponds to the base_commit of the latest (a.k.a. most recent) task instance from that repo/version combination.

So as an example, if there are 10 (<- just an example, not necessarily the actual number) instances that fall under astropy/astropy version 1.5, the environment_setup_commit corresponds to the base_commit of the most recent task instance. Empirically, we found that the last instance of a repo/version tends to be a good reference for the installation requirements of all instances from that repo/version.

As a result of this, the commit referenced by environment_setup_commits is more recent than any other commits from that repo/version.

Code to show this:

from datasets import load_dataset
swebench = load_dataset('princeton-nlp/SWE-bench', split='test')

# Create map of each repo/version's environment_setup_commit to the creation date (`created_at`) of that commit
map_rv_to_date = {}
for inst in swebench:
    if inst['base_commit'] == inst['environment_setup_commit']:
        map_rv_to_date[inst['repo'] + inst['version']] = inst['created_at']

# Check that all instances' `created_at` values are less (earlier) than the corresponding creation date of the environment_setup_commit
all([
    inst['created_at'] <= map_rv_to_date[inst['repo'] + inst['version']]
    for inst in swebench
    if inst['repo'] + inst['version'] in map_rv_to_date
])

Running this should give True