princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.81k stars 311 forks source link

Issue with Django gold patch results not evaluating to "Resolved" when logs say "All tests passed" #80

Closed brombaut closed 3 months ago

brombaut commented 6 months ago

I'm seeing behavior where if I run the evaluation script against the gold patch of a Django task instance, the resulting logs say that "All tests passed", but when I hand these logs to the reporting system (i.e., get_model_report, which under the hood uses the Django logs parser), it says the issue was not resolved. I am producing this issue with the task instance ID django__django-11039, using the provided gold patch, and included files with all the details necessary to reproduce.

Tasks For Evaluation file (tasks_for_evaluation.json)
```json [ { "instance_id": "django__django-11039", "base_commit": "d5276398046ce4a102776a1e67dcac2884d80dfe", "version": "3.0", "repo": "django/django", "test_patch": "diff --git a/tests/migrations/test_commands.py b/tests/migrations/test_commands.py\n--- a/tests/migrations/test_commands.py\n+++ b/tests/migrations/test_commands.py\n@@ -536,7 +536,13 @@ def test_sqlmigrate_forwards(self):\n index_op_desc_unique_together = output.find('-- alter unique_together')\n index_tx_end = output.find(connection.ops.end_transaction_sql().lower())\n \n- self.assertGreater(index_tx_start, -1, \"Transaction start not found\")\n+ if connection.features.can_rollback_ddl:\n+ self.assertGreater(index_tx_start, -1, \"Transaction start not found\")\n+ self.assertGreater(\n+ index_tx_end, index_op_desc_unique_together,\n+ \"Transaction end not found or found before operation description (unique_together)\"\n+ )\n+\n self.assertGreater(\n index_op_desc_author, index_tx_start,\n \"Operation description (author) not found or found before transaction start\"\n@@ -553,10 +559,6 @@ def test_sqlmigrate_forwards(self):\n index_op_desc_unique_together, index_op_desc_tribble,\n \"Operation description (unique_together) not found or found before operation description (tribble)\"\n )\n- self.assertGreater(\n- index_tx_end, index_op_desc_unique_together,\n- \"Transaction end not found or found before operation description (unique_together)\"\n- )\n \n @override_settings(MIGRATION_MODULES={\"migrations\": \"migrations.test_migrations\"})\n def test_sqlmigrate_backwards(self):\n@@ -577,7 +579,12 @@ def test_sqlmigrate_backwards(self):\n index_drop_table = output.rfind('drop table')\n index_tx_end = output.find(connection.ops.end_transaction_sql().lower())\n \n- self.assertGreater(index_tx_start, -1, \"Transaction start not found\")\n+ if connection.features.can_rollback_ddl:\n+ self.assertGreater(index_tx_start, -1, \"Transaction start not found\")\n+ self.assertGreater(\n+ index_tx_end, index_op_desc_unique_together,\n+ \"Transaction end not found or found before DROP TABLE\"\n+ )\n self.assertGreater(\n index_op_desc_unique_together, index_tx_start,\n \"Operation description (unique_together) not found or found before transaction start\"\n@@ -595,10 +602,6 @@ def test_sqlmigrate_backwards(self):\n index_drop_table, index_op_desc_author,\n \"DROP TABLE not found or found before operation description (author)\"\n )\n- self.assertGreater(\n- index_tx_end, index_op_desc_unique_together,\n- \"Transaction end not found or found before DROP TABLE\"\n- )\n \n # Cleanup by unmigrating everything\n call_command(\"migrate\", \"migrations\", \"zero\", verbosity=0)\n@@ -616,6 +619,22 @@ def test_sqlmigrate_for_non_atomic_migration(self):\n self.assertNotIn(connection.ops.start_transaction_sql().lower(), queries)\n self.assertNotIn(connection.ops.end_transaction_sql().lower(), queries)\n \n+ @override_settings(MIGRATION_MODULES={'migrations': 'migrations.test_migrations'})\n+ def test_sqlmigrate_for_non_transactional_databases(self):\n+ \"\"\"\n+ Transaction wrappers aren't shown for databases that don't support\n+ transactional DDL.\n+ \"\"\"\n+ out = io.StringIO()\n+ with mock.patch.object(connection.features, 'can_rollback_ddl', False):\n+ call_command('sqlmigrate', 'migrations', '0001', stdout=out)\n+ output = out.getvalue().lower()\n+ queries = [q.strip() for q in output.splitlines()]\n+ start_transaction_sql = connection.ops.start_transaction_sql()\n+ if start_transaction_sql:\n+ self.assertNotIn(start_transaction_sql.lower(), queries)\n+ self.assertNotIn(connection.ops.end_transaction_sql().lower(), queries)\n+\n @override_settings(\n INSTALLED_APPS=[\n \"migrations.migrations_test_apps.migrated_app\",\n", "created_at": "2019-03-01T10:24:38Z", "FAIL_TO_PASS": [ "test_sqlmigrate_for_non_transactional_databases (migrations.test_commands.MigrateTests)" ], "PASS_TO_PASS": [ "--squashed-name also works if a start migration is omitted.", "--squashed-name specifies the new migration's name.", "Migration directories without an __init__.py file are allowed.", "Tests migrate --plan output.", "test_ambigious_prefix (migrations.test_commands.MigrateTests)", "test_app_without_migrations (migrations.test_commands.MigrateTests)", "test_failing_migration (migrations.test_commands.MakeMigrationsTests)", "test_files_content (migrations.test_commands.MakeMigrationsTests)", "test_makemigration_merge_dry_run (migrations.test_commands.MakeMigrationsTests)", "test_makemigration_merge_dry_run_verbosity_3 (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_app_name_specified_as_label (migrations.test_commands.AppLabelErrorTests)", "test_makemigrations_auto_now_add_interactive (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_check (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_conflict_exit (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_consistency_checks_respect_routers (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_default_merge_name (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_disabled_migrations_for_app (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_dry_run (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_dry_run_verbosity_3 (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_empty_connections (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_empty_migration (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_empty_no_app_specified (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_handle_merge (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_inconsistent_history (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_interactive_accept (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_interactive_by_default (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_interactive_reject (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_merge_dont_output_dependency_operations (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_merge_no_conflict (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_migration_path_output (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_migration_path_output_valueerror (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_migrations_announce (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_migrations_modules_nonexistent_toplevel_package (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_migrations_modules_path_not_exist (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_no_apps_initial (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_no_changes (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_no_changes_no_apps (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_no_common_ancestor (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_non_interactive_no_field_rename (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_non_interactive_no_model_rename (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_non_interactive_not_null_addition (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_non_interactive_not_null_alteration (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_nonexistent_app_label (migrations.test_commands.AppLabelErrorTests)", "test_makemigrations_order (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_unspecified_app_with_conflict_merge (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_unspecified_app_with_conflict_no_merge (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_with_custom_name (migrations.test_commands.MakeMigrationsTests)", "test_makemigrations_with_invalid_custom_name (migrations.test_commands.MakeMigrationsTests)", "test_migrate (migrations.test_commands.MigrateTests)", "test_migrate_app_name_specified_as_label (migrations.test_commands.AppLabelErrorTests)", "test_migrate_conflict_exit (migrations.test_commands.MigrateTests)", "test_migrate_fake_initial (migrations.test_commands.MigrateTests)", "test_migrate_fake_split_initial (migrations.test_commands.MigrateTests)", "test_migrate_inconsistent_history (migrations.test_commands.MigrateTests)", "test_migrate_initial_false (migrations.test_commands.MigrateTests)", "test_migrate_nonexistent_app_label (migrations.test_commands.AppLabelErrorTests)", "test_migrate_record_replaced (migrations.test_commands.MigrateTests)", "test_migrate_record_squashed (migrations.test_commands.MigrateTests)", "test_migrate_syncdb_app_label (migrations.test_commands.MigrateTests)", "test_migrate_syncdb_app_with_migrations (migrations.test_commands.MigrateTests)", "test_migrate_syncdb_deferred_sql_executed_with_schemaeditor (migrations.test_commands.MigrateTests)", "test_migrate_with_system_checks (migrations.test_commands.MigrateTests)", "test_regression_22823_unmigrated_fk_to_migrated_model (migrations.test_commands.MigrateTests)", "test_showmigrations_app_name_specified_as_label (migrations.test_commands.AppLabelErrorTests)", "test_showmigrations_list (migrations.test_commands.MigrateTests)", "test_showmigrations_no_migrations (migrations.test_commands.MigrateTests)", "test_showmigrations_nonexistent_app_label (migrations.test_commands.AppLabelErrorTests)", "test_showmigrations_plan (migrations.test_commands.MigrateTests)", "test_showmigrations_plan_app_label_no_migrations (migrations.test_commands.MigrateTests)", "test_showmigrations_plan_multiple_app_labels (migrations.test_commands.MigrateTests)", "test_showmigrations_plan_no_migrations (migrations.test_commands.MigrateTests)", "test_showmigrations_plan_single_app_label (migrations.test_commands.MigrateTests)", "test_showmigrations_plan_squashed (migrations.test_commands.MigrateTests)", "test_showmigrations_unmigrated_app (migrations.test_commands.MigrateTests)", "test_sqlmigrate_app_name_specified_as_label (migrations.test_commands.AppLabelErrorTests)", "test_sqlmigrate_backwards (migrations.test_commands.MigrateTests)", "test_sqlmigrate_for_non_atomic_migration (migrations.test_commands.MigrateTests)", "test_sqlmigrate_forwards (migrations.test_commands.MigrateTests)", "test_sqlmigrate_nonexistent_app_label (migrations.test_commands.AppLabelErrorTests)", "test_squashmigrations_app_name_specified_as_label (migrations.test_commands.AppLabelErrorTests)", "test_squashmigrations_initial_attribute (migrations.test_commands.SquashMigrationsTests)", "test_squashmigrations_invalid_start (migrations.test_commands.SquashMigrationsTests)", "test_squashmigrations_nonexistent_app_label (migrations.test_commands.AppLabelErrorTests)", "test_squashmigrations_optimizes (migrations.test_commands.SquashMigrationsTests)", "test_squashmigrations_squashes (migrations.test_commands.SquashMigrationsTests)", "test_squashmigrations_valid_start (migrations.test_commands.SquashMigrationsTests)", "test_ticket_23799_squashmigrations_no_optimize (migrations.test_commands.SquashMigrationsTests)", "test_unknown_prefix (migrations.test_commands.MigrateTests)" ] } ] ```
Predictions For Evaluation file (predictions_for_evaluation.json)
```json [ { "instance_id": "django__django-11039", "model_patch": "diff --git a/django/core/management/commands/sqlmigrate.py b/django/core/management/commands/sqlmigrate.py\n--- a/django/core/management/commands/sqlmigrate.py\n+++ b/django/core/management/commands/sqlmigrate.py\n@@ -55,8 +55,9 @@ def handle(self, *args, **options):\n migration_name, app_label))\n targets = [(app_label, migration.name)]\n \n- # Show begin/end around output only for atomic migrations\n- self.output_transaction = migration.atomic\n+ # Show begin/end around output for atomic migrations, if the database\n+ # supports transactional DDL.\n+ self.output_transaction = migration.atomic and connection.features.can_rollback_ddl\n \n # Make a plan that represents just the requested migrations and show SQL\n # for it\n", "model_name_or_path": "gpt-4-0125-preview" } ] ```
Evaluation Logs (django__django-11039.gpt-4-0125-preview.eval.log)
```bash Task Metadata: - Instance ID: django__django-11039 - Testbed: /swebench_workspace/experiments/swebench_lite_to_humaneval/swebench_evaluations_full_file_generation_v1/evaluation_testbed/gpt-4-0125-preview/django__django/3.0/tmp03lbund9/django__django__3.0 - Virtual Env.: django__django__3.0 - Evaluation Model: gpt-4-0125-preview >>>>> Applied Patch (pred_try) >>>>> Applied Patch (pred_try) Installation Command: . /swebench_workspace/experiments/swebench_lite_to_humaneval/swebench_evaluations_full_file_generation_v1/evaluation_testbed/gpt-4-0125-preview/django__django/3.0/tmpnvzr539l/miniconda3/bin/activate django__django__3.0 && echo 'activate successful' && python -m pip install -e . Std. Output: activate successful Obtaining file:///swebench_workspace/experiments/swebench_lite_to_humaneval/swebench_evaluations_full_file_generation_v1/evaluation_testbed/gpt-4-0125-preview/django__django/3.0/tmp03lbund9/django__django__3.0 Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done' Requirement already satisfied: pytz in /swebench_workspace/experiments/swebench_lite_to_humaneval/swebench_evaluations_full_file_generation_v1/evaluation_testbed/gpt-4-0125-preview/django__django/3.0/tmpnvzr539l/miniconda3/lib/python3.11/site-packages (from Django==3.0.dev20190307150218) (2024.1) Requirement already satisfied: sqlparse in /swebench_workspace/experiments/swebench_lite_to_humaneval/swebench_evaluations_full_file_generation_v1/evaluation_testbed/gpt-4-0125-preview/django__django/3.0/tmpnvzr539l/miniconda3/lib/python3.11/site-packages (from Django==3.0.dev20190307150218) (0.4.4) Installing collected packages: Django Running setup.py develop for Django Successfully installed Django-3.0.dev20190307150218 Std. Error: WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv >>>>> Init Succeeded >>>>> Applied Patch (test) >>>>> Applied Patch (pred) Test Script: . /swebench_workspace/experiments/swebench_lite_to_humaneval/swebench_evaluations_full_file_generation_v1/evaluation_testbed/gpt-4-0125-preview/django__django/3.0/tmpnvzr539l/miniconda3/bin/activate django__django__3.0 && echo 'activate successful' && ./tests/runtests.py --verbosity 2 migrations.test_commands; Output: activate successful Testing against Django installed in '/swebench_workspace/experiments/swebench_lite_to_humaneval/swebench_evaluations_full_file_generation_v1/evaluation_testbed/gpt-4-0125-preview/django__django/3.0/tmp03lbund9/django__django__3.0/django' with up to 64 processes Importing application migrations Operations to perform: Synchronize unmigrated apps: auth, contenttypes, messages, migrations, sessions, staticfiles Apply all migrations: admin, sites Synchronizing apps without migrations: Creating tables... Creating table django_content_type Creating table auth_permission Creating table auth_group Creating table auth_user Creating table django_session Creating table migrations_modelwithcustombase Creating table migrations_unmigratedmodel Running deferred SQL... Running migrations: Applying admin.0001_initial... OK Applying admin.0002_logentry_remove_auto_add... OK Applying admin.0003_logentry_add_action_flag_choices... OK Applying sites.0001_initial... OK Applying sites.0002_alter_domain_unique... OK Operations to perform: Synchronize unmigrated apps: auth, contenttypes, messages, migrations, sessions, staticfiles Apply all migrations: admin, sites Synchronizing apps without migrations: Creating tables... Creating table django_content_type Creating table auth_permission Creating table auth_group Creating table auth_user Creating table django_session Creating table migrations_modelwithcustombase Creating table migrations_unmigratedmodel Running deferred SQL... Running migrations: Applying admin.0001_initial... OK Applying admin.0002_logentry_remove_auto_add... OK Applying admin.0003_logentry_add_action_flag_choices... OK Applying sites.0001_initial... OK Applying sites.0002_alter_domain_unique... OK System check identified no issues (0 silenced). Creating test database for alias 'default' ('file:memorydb_default?mode=memory&cache=shared')... Cloning test database for alias 'default' ('file:memorydb_default?mode=memory&cache=shared')... Cloning test database for alias 'default' ('file:memorydb_default?mode=memory&cache=shared')... Cloning test database for alias 'default' ('file:memorydb_default?mode=memory&cache=shared')... Cloning test database for alias 'default' ('file:memorydb_default?mode=memory&cache=shared')... Creating test database for alias 'other' ('file:memorydb_other?mode=memory&cache=shared')... Cloning test database for alias 'other' ('file:memorydb_other?mode=memory&cache=shared')... Cloning test database for alias 'other' ('file:memorydb_other?mode=memory&cache=shared')... Cloning test database for alias 'other' ('file:memorydb_other?mode=memory&cache=shared')... Cloning test database for alias 'other' ('file:memorydb_other?mode=memory&cache=shared')... test_makemigrations_app_name_specified_as_label (migrations.test_commands.AppLabelErrorTests.test_makemigrations_app_name_specified_as_label) ... ok test_makemigrations_nonexistent_app_label (migrations.test_commands.AppLabelErrorTests.test_makemigrations_nonexistent_app_label) ... ok test_migrate_app_name_specified_as_label (migrations.test_commands.AppLabelErrorTests.test_migrate_app_name_specified_as_label) ... ok test_migrate_nonexistent_app_label (migrations.test_commands.AppLabelErrorTests.test_migrate_nonexistent_app_label) ... ok test_showmigrations_app_name_specified_as_label (migrations.test_commands.AppLabelErrorTests.test_showmigrations_app_name_specified_as_label) ... ok test_showmigrations_nonexistent_app_label (migrations.test_commands.AppLabelErrorTests.test_showmigrations_nonexistent_app_label) ... ok test_sqlmigrate_app_name_specified_as_label (migrations.test_commands.AppLabelErrorTests.test_sqlmigrate_app_name_specified_as_label) ... ok test_sqlmigrate_nonexistent_app_label (migrations.test_commands.AppLabelErrorTests.test_sqlmigrate_nonexistent_app_label) ... ok test_squashmigrations_app_name_specified_as_label (migrations.test_commands.AppLabelErrorTests.test_squashmigrations_app_name_specified_as_label) ... ok test_squashmigrations_nonexistent_app_label (migrations.test_commands.AppLabelErrorTests.test_squashmigrations_nonexistent_app_label) ... ok test_squashed_name_with_start_migration_name (migrations.test_commands.SquashMigrationsTests.test_squashed_name_with_start_migration_name) --squashed-name specifies the new migration's name. ... ok test_squashed_name_without_start_migration_name (migrations.test_commands.SquashMigrationsTests.test_squashed_name_without_start_migration_name) --squashed-name also works if a start migration is omitted. ... ok test_squashmigrations_initial_attribute (migrations.test_commands.SquashMigrationsTests.test_squashmigrations_initial_attribute) ... ok test_squashmigrations_invalid_start (migrations.test_commands.SquashMigrationsTests.test_squashmigrations_invalid_start) squashmigrations doesn't accept a starting migration after the ending migration. ... ok test_squashmigrations_optimizes (migrations.test_commands.SquashMigrationsTests.test_squashmigrations_optimizes) squashmigrations optimizes operations. ... ok test_squashmigrations_squashes (migrations.test_commands.SquashMigrationsTests.test_squashmigrations_squashes) squashmigrations squashes migrations. ... ok test_squashmigrations_valid_start (migrations.test_commands.SquashMigrationsTests.test_squashmigrations_valid_start) squashmigrations accepts a starting migration. ... ok test_ticket_23799_squashmigrations_no_optimize (migrations.test_commands.SquashMigrationsTests.test_ticket_23799_squashmigrations_no_optimize) squashmigrations --no-optimize doesn't optimize operations. ... ok test_failing_migration (migrations.test_commands.MakeMigrationsTests.test_failing_migration) ... ok test_files_content (migrations.test_commands.MakeMigrationsTests.test_files_content) ... ok test_makemigration_merge_dry_run (migrations.test_commands.MakeMigrationsTests.test_makemigration_merge_dry_run) makemigrations respects --dry-run option when fixing migration ... ok test_makemigration_merge_dry_run_verbosity_3 (migrations.test_commands.MakeMigrationsTests.test_makemigration_merge_dry_run_verbosity_3) `makemigrations --merge --dry-run` writes the merge migration file to ... ok test_makemigrations_auto_now_add_interactive (migrations.test_commands.MakeMigrationsTests.test_makemigrations_auto_now_add_interactive) makemigrations prompts the user when adding auto_now_add to an existing ... ok test_makemigrations_check (migrations.test_commands.MakeMigrationsTests.test_makemigrations_check) makemigrations --check should exit with a non-zero status when ... ok test_makemigrations_conflict_exit (migrations.test_commands.MakeMigrationsTests.test_makemigrations_conflict_exit) makemigrations exits if it detects a conflict. ... ok test_makemigrations_consistency_checks_respect_routers (migrations.test_commands.MakeMigrationsTests.test_makemigrations_consistency_checks_respect_routers) The history consistency checks in makemigrations respect ... ok test_makemigrations_default_merge_name (migrations.test_commands.MakeMigrationsTests.test_makemigrations_default_merge_name) ... ok test_makemigrations_disabled_migrations_for_app (migrations.test_commands.MakeMigrationsTests.test_makemigrations_disabled_migrations_for_app) makemigrations raises a nice error when migrations are disabled for an ... ok test_makemigrations_dry_run (migrations.test_commands.MakeMigrationsTests.test_makemigrations_dry_run) `makemigrations --dry-run` should not ask for defaults. ... ok test_makemigrations_dry_run_verbosity_3 (migrations.test_commands.MakeMigrationsTests.test_makemigrations_dry_run_verbosity_3) Allow `makemigrations --dry-run` to output the migrations file to ... ok test_makemigrations_empty_connections (migrations.test_commands.MakeMigrationsTests.test_makemigrations_empty_connections) ... ok test_makemigrations_empty_migration (migrations.test_commands.MakeMigrationsTests.test_makemigrations_empty_migration) makemigrations properly constructs an empty migration. ... ok test_makemigrations_empty_no_app_specified (migrations.test_commands.MakeMigrationsTests.test_makemigrations_empty_no_app_specified) makemigrations exits if no app is specified with 'empty' mode. ... ok test_makemigrations_handle_merge (migrations.test_commands.MakeMigrationsTests.test_makemigrations_handle_merge) makemigrations properly merges the conflicting migrations with --noinput. ... ok test_makemigrations_inconsistent_history (migrations.test_commands.MakeMigrationsTests.test_makemigrations_inconsistent_history) makemigrations should raise InconsistentMigrationHistory exception if ... ok test_makemigrations_interactive_accept (migrations.test_commands.MakeMigrationsTests.test_makemigrations_interactive_accept) makemigrations enters interactive mode and merges properly. ... ok test_makemigrations_interactive_by_default (migrations.test_commands.MakeMigrationsTests.test_makemigrations_interactive_by_default) The user is prompted to merge by default if there are conflicts and ... ok test_makemigrations_interactive_reject (migrations.test_commands.MakeMigrationsTests.test_makemigrations_interactive_reject) makemigrations enters and exits interactive mode properly. ... ok test_makemigrations_merge_dont_output_dependency_operations (migrations.test_commands.MakeMigrationsTests.test_makemigrations_merge_dont_output_dependency_operations) makemigrations --merge does not output any operations from apps that ... ok test_makemigrations_merge_no_conflict (migrations.test_commands.MakeMigrationsTests.test_makemigrations_merge_no_conflict) makemigrations exits if in merge mode with no conflicts. ... ok test_makemigrations_migration_path_output (migrations.test_commands.MakeMigrationsTests.test_makemigrations_migration_path_output) makemigrations should print the relative paths to the migrations unless ... ok test_makemigrations_migration_path_output_valueerror (migrations.test_commands.MakeMigrationsTests.test_makemigrations_migration_path_output_valueerror) makemigrations prints the absolute path if os.path.relpath() raises a ... ok test_makemigrations_migrations_announce (migrations.test_commands.MakeMigrationsTests.test_makemigrations_migrations_announce) makemigrations announces the migration at the default verbosity level. ... ok test_makemigrations_migrations_modules_nonexistent_toplevel_package (migrations.test_commands.MakeMigrationsTests.test_makemigrations_migrations_modules_nonexistent_toplevel_package) ... ok test_makemigrations_migrations_modules_path_not_exist (migrations.test_commands.MakeMigrationsTests.test_makemigrations_migrations_modules_path_not_exist) makemigrations creates migrations when specifying a custom location ... ok test_makemigrations_no_apps_initial (migrations.test_commands.MakeMigrationsTests.test_makemigrations_no_apps_initial) makemigrations should detect initial is needed on empty migration ... ok test_makemigrations_no_changes (migrations.test_commands.MakeMigrationsTests.test_makemigrations_no_changes) makemigrations exits when there are no changes to an app. ... ok test_makemigrations_no_changes_no_apps (migrations.test_commands.MakeMigrationsTests.test_makemigrations_no_changes_no_apps) makemigrations exits when there are no changes and no apps are specified. ... ok test_makemigrations_no_common_ancestor (migrations.test_commands.MakeMigrationsTests.test_makemigrations_no_common_ancestor) makemigrations fails to merge migrations with no common ancestor. ... ok test_makemigrations_no_init (migrations.test_commands.MakeMigrationsTests.test_makemigrations_no_init) Migration directories without an __init__.py file are allowed. ... ok test_makemigrations_non_interactive_no_field_rename (migrations.test_commands.MakeMigrationsTests.test_makemigrations_non_interactive_no_field_rename) makemigrations adds and removes a possible field rename in ... ok test_makemigrations_non_interactive_no_model_rename (migrations.test_commands.MakeMigrationsTests.test_makemigrations_non_interactive_no_model_rename) makemigrations adds and removes a possible model rename in ... ok test_makemigrations_non_interactive_not_null_addition (migrations.test_commands.MakeMigrationsTests.test_makemigrations_non_interactive_not_null_addition) Non-interactive makemigrations fails when a default is missing on a ... ok test_makemigrations_non_interactive_not_null_alteration (migrations.test_commands.MakeMigrationsTests.test_makemigrations_non_interactive_not_null_alteration) Non-interactive makemigrations fails when a default is missing on a ... ok test_makemigrations_order (migrations.test_commands.MakeMigrationsTests.test_makemigrations_order) makemigrations should recognize number-only migrations (0001.py). ... ok test_makemigrations_unspecified_app_with_conflict_merge (migrations.test_commands.MakeMigrationsTests.test_makemigrations_unspecified_app_with_conflict_merge) makemigrations does not create a merge for an unspecified app even if ... ok test_makemigrations_unspecified_app_with_conflict_no_merge (migrations.test_commands.MakeMigrationsTests.test_makemigrations_unspecified_app_with_conflict_no_merge) makemigrations does not raise a CommandError when an unspecified app ... ok test_makemigrations_with_custom_name (migrations.test_commands.MakeMigrationsTests.test_makemigrations_with_custom_name) makemigrations --name generate a custom migration name. ... ok test_makemigrations_with_invalid_custom_name (migrations.test_commands.MakeMigrationsTests.test_makemigrations_with_invalid_custom_name) ... ok test_ambigious_prefix (migrations.test_commands.MigrateTests.test_ambigious_prefix) ... ok test_app_without_migrations (migrations.test_commands.MigrateTests.test_app_without_migrations) ... ok test_migrate (migrations.test_commands.MigrateTests.test_migrate) Tests basic usage of the migrate command. ... ok test_migrate_conflict_exit (migrations.test_commands.MigrateTests.test_migrate_conflict_exit) migrate exits if it detects a conflict. ... ok test_migrate_fake_initial (migrations.test_commands.MigrateTests.test_migrate_fake_initial) --fake-initial only works if all tables created in the initial ... ok test_migrate_fake_split_initial (migrations.test_commands.MigrateTests.test_migrate_fake_split_initial) Split initial migrations can be faked with --fake-initial. ... ok test_migrate_inconsistent_history (migrations.test_commands.MigrateTests.test_migrate_inconsistent_history) Running migrate with some migrations applied before their dependencies ... ok test_migrate_initial_false (migrations.test_commands.MigrateTests.test_migrate_initial_false) `Migration.initial = False` skips fake-initial detection. ... ok test_migrate_plan (migrations.test_commands.MigrateTests.test_migrate_plan) Tests migrate --plan output. ... ok test_migrate_record_replaced (migrations.test_commands.MigrateTests.test_migrate_record_replaced) Running a single squashed migration should record all of the original ... ok test_migrate_record_squashed (migrations.test_commands.MigrateTests.test_migrate_record_squashed) Running migrate for a squashed migration should record as run ... ok test_migrate_syncdb_app_label (migrations.test_commands.MigrateTests.test_migrate_syncdb_app_label) Running migrate --run-syncdb with an app_label only creates tables for ... ok test_migrate_syncdb_app_with_migrations (migrations.test_commands.MigrateTests.test_migrate_syncdb_app_with_migrations) ... ok test_migrate_syncdb_deferred_sql_executed_with_schemaeditor (migrations.test_commands.MigrateTests.test_migrate_syncdb_deferred_sql_executed_with_schemaeditor) For an app without migrations, editor.execute() is used for executing ... ok test_migrate_with_system_checks (migrations.test_commands.MigrateTests.test_migrate_with_system_checks) ... ok test_regression_22823_unmigrated_fk_to_migrated_model (migrations.test_commands.MigrateTests.test_regression_22823_unmigrated_fk_to_migrated_model) Assuming you have 3 apps, `A`, `B`, and `C`, such that: ... ok test_showmigrations_list (migrations.test_commands.MigrateTests.test_showmigrations_list) showmigrations --list displays migrations and whether or not they're ... ok test_showmigrations_no_migrations (migrations.test_commands.MigrateTests.test_showmigrations_no_migrations) ... ok test_showmigrations_plan (migrations.test_commands.MigrateTests.test_showmigrations_plan) Tests --plan output of showmigrations command ... ok test_showmigrations_plan_app_label_no_migrations (migrations.test_commands.MigrateTests.test_showmigrations_plan_app_label_no_migrations) ... ok test_showmigrations_plan_multiple_app_labels (migrations.test_commands.MigrateTests.test_showmigrations_plan_multiple_app_labels) `showmigrations --plan app_label` output with multiple app_labels. ... ok test_showmigrations_plan_no_migrations (migrations.test_commands.MigrateTests.test_showmigrations_plan_no_migrations) Tests --plan output of showmigrations command without migrations ... ok test_showmigrations_plan_single_app_label (migrations.test_commands.MigrateTests.test_showmigrations_plan_single_app_label) `showmigrations --plan app_label` output with a single app_label. ... ok test_showmigrations_plan_squashed (migrations.test_commands.MigrateTests.test_showmigrations_plan_squashed) Tests --plan output of showmigrations command with squashed migrations. ... ok test_showmigrations_unmigrated_app (migrations.test_commands.MigrateTests.test_showmigrations_unmigrated_app) ... ok test_sqlmigrate_backwards (migrations.test_commands.MigrateTests.test_sqlmigrate_backwards) sqlmigrate outputs reverse looking SQL. ... ok test_sqlmigrate_for_non_atomic_migration (migrations.test_commands.MigrateTests.test_sqlmigrate_for_non_atomic_migration) Transaction wrappers aren't shown for non-atomic migrations. ... ok test_sqlmigrate_for_non_transactional_databases (migrations.test_commands.MigrateTests.test_sqlmigrate_for_non_transactional_databases) Transaction wrappers aren't shown for databases that don't support ... ok test_sqlmigrate_forwards (migrations.test_commands.MigrateTests.test_sqlmigrate_forwards) sqlmigrate outputs forward looking SQL. ... ok test_unknown_prefix (migrations.test_commands.MigrateTests.test_unknown_prefix) ... ok ---------------------------------------------------------------------- Ran 89 tests in 0.445s OK Destroying test database for alias 'default' ('file:memorydb_default?mode=memory&cache=shared')... Destroying test database for alias 'default' ('file:memorydb_default?mode=memory&cache=shared')... Destroying test database for alias 'default' ('file:memorydb_default?mode=memory&cache=shared')... Destroying test database for alias 'default' ('file:memorydb_default?mode=memory&cache=shared')... Destroying test database for alias 'default' ('file:memorydb_default?mode=memory&cache=shared')... Destroying test database for alias 'other' ('file:memorydb_other?mode=memory&cache=shared')... Destroying test database for alias 'other' ('file:memorydb_other?mode=memory&cache=shared')... Destroying test database for alias 'other' ('file:memorydb_other?mode=memory&cache=shared')... Destroying test database for alias 'other' ('file:memorydb_other?mode=memory&cache=shared')... Destroying test database for alias 'other' ('file:memorydb_other?mode=memory&cache=shared')... >>>>> All Tests Passed ```

Results when getting evaluation report: image

brombaut commented 6 months ago

So the test that should go from fail to pass is

"FAIL_TO_PASS": [
    "test_sqlmigrate_for_non_transactional_databases (migrations.test_commands.MigrateTests)"
],

You can see the 2 lines in the log file:

test_sqlmigrate_for_non_transactional_databases (migrations.test_commands.MigrateTests.test_sqlmigrate_for_non_transactional_databases)
Transaction wrappers aren't shown for databases that don't support ... ok

I could be wrong, but it doesn't look like the log parser shown below handles the case of there being 2 lines? Or am I reading this wrong?

# swebench/metrics/log_parsers.py
# ...
def parse_log_django(log: str) -> dict:
    """
    Parser for test logs generated with Django tester framework

    Args:
        log (str): log content
    Returns:
        dict: test case to test status mapping
    """
    test_status_map = {}
    lines = log.split("\n")
    for line in lines:
        line = line.strip()
        if line.endswith(" ... ok"):
            test = line.split(" ... ok")[0]
            test_status_map[test] = TestStatus.PASSED.value
        if " ... skipped" in line:
            test = line.split(" ... skipped")[0]
            test_status_map[test] = TestStatus.SKIPPED.value
        if line.endswith(" ... FAIL"):
            test = line.split(" ... FAIL")[0]
            test_status_map[test] = TestStatus.FAILED.value
        if line.startswith("FAIL:"):
            test = line.split()[1].strip()
            test_status_map[test] = TestStatus.FAILED.value
        if line.endswith(" ... ERROR"):
            test = line.split(" ... ERROR")[0]
            test_status_map[test] = TestStatus.ERROR.value
        if line.startswith("ERROR:"):
            test = line.split()[1].strip()
            test_status_map[test] = TestStatus.ERROR.value
    return test_status_map
brombaut commented 6 months ago

74 might be related to this issue

brombaut commented 5 months ago

I found another case that's leading to false negative resolved status with Django, basically, the test it name is removed from test module/name path in the tasks_for_evaluation.json file (which was built from the original dataset), but it is included in the generated logfile, and so they do not match, see below.

"PASS_TO_PASS": [
    ...,
    "test_ambigious_prefix (migrations.test_commands.MigrateTests)",
    ...,
]

And the test status in the log file (notice the final test_ambigious_prefix suffix)

test_ambigious_prefix (migrations.test_commands.MigrateTests.test_ambigious_prefix) ... ok
brombaut commented 5 months ago

I've narrowed this down to this having to do with the Django version of the task instance. I'm using the SWE-Bench Lite dataset, and when I run the evaluation/metrics pipeline against the gold patches of Django version 5.0, they are all marked successfully as resolved. However, any previous versions < 5.0 results in the evaluation log files being generated like I've mentioned in this issue, and so they are not marked as resolved because the log parser incorrectly parses them.

So I guess this is somehow an environment issue? But I assume this was handled properly because obviously this whole pipeline has been run successfully for all instances in the paper.

john-b-yang commented 5 months ago

Hi @brombaut we just released a report on the fixes we've been working on to get SWE-bench evaluation to work reliably, you can read about it here.

Thanks so much for the all the detail and follow up analysis, we will respond more promptly going forwards.

Based on what you listed above, I think what may have happened is what you've already alluded to, that something about the Django installation or Pytest may have changed since that task instance was originally created and run. Based on the execution logs, it does look like that test should be marked as a pass.

We added a bunch of installation specifications (e.g. explicitly specifying PyPI version, also setting the miniconda installation link explicitly) to reduce these types of problems.

Can you try running again and seeing what happens? Thanks.

brombaut commented 5 months ago

@john-b-yang With these new changes, I'm still kind of blocked until #77 gets resolved, I havnt found a workaround for that one that doesn't involve going back SWE-Bench versions

yhzx233 commented 4 months ago

Well, I believe this issue is not just an environment problem. @john-b-yang

The key issue is that parse_log_django indeed cannot handle the case where a test produces two lines of output. In the latest Django test scripts, if a function contains a docstring, the content will be output after the test name during testing.

https://github.com/swe-bench/experiments/blob/1b4da98f80a30cd6d3bab6bf46f01196a0c89ba8/evaluation/unofficial/20240402_devin/logs/django__django-15766.202404_devin.eval.log#L394

In the previously completed test logs, there is also such output, and I believe the test reports generated for such logs would also have issues.

john-b-yang commented 3 months ago

Thanks all for contributing to this issue! The discussion here has been very helpful. Last week, we released a revamped SWE-bench evaluation harness (report here) that should resolve a good number of the reproducibility issues (#142, #162) and Django parsing (#166).

I'm closing this issue now as it's a bit old + the new harness should resolve many of these problems. I'd recommend pulling the latest commit and running the evaluation as discussed in the root README (it's very easy!). If there's any problems, please feel free to post them as new issues, and we can continue the discussion there.