logic-star-ai / swt-bench

[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation
https://openreview.net/forum?id=9Y8zUO11EQ&noteId=9Y8zUO11EQ
MIT License
16 stars 2 forks source link

Generated Test Cases by Different Agents in SWE-BENCH #1

Open nashid opened 1 month ago

nashid commented 1 month ago

Describe the issue

Congrats on your excellent work on Code Agents for automated test generation!

I am interested in gaining a deeper understanding of the results presented in your paper, especially regarding the performance of different agents at generating test cases.

Thank you again for sharing the artifact and for considering my request!

Suggest an improvement to documentation

No response

nielstron commented 4 weeks ago

Hi @nashid, Thank you for your inquiry. We have now published the raw results of our benchmark (i.e. the traces of the agents, the predicted patches and the logs of the evaluation harness) on our main README. You can find them here: https://github.com/logic-star-ai/swt-bench/tree/master?tab=readme-ov-file#evaluation-results

Below is a JSON dump containing all successfully resolved instances for each approach. Please refer to the patch prediction files or the harness logs (_approach_/_model_/_instance_/extracted_patch.diff) for the respective test cases.

Full list of resolved instances per approach ```json { "LIBRO": [ "sympy__sympy-14317", "sympy__sympy-14774", "sympy__sympy-19007", "sympy__sympy-18189", "django__django-15400", "django__django-13757", "django__django-15498", "sympy__sympy-24066", "django__django-11099", "django__django-11039", "matplotlib__matplotlib-26011", "sympy__sympy-13895", "django__django-16255", "sympy__sympy-20442", "sympy__sympy-20212", "sympy__sympy-18621", "sympy__sympy-21614", "sympy__sympy-15011", "sympy__sympy-23117", "sympy__sympy-21847", "django__django-11133", "scikit-learn__scikit-learn-14894", "django__django-16595", "sympy__sympy-17139", "scikit-learn__scikit-learn-14087", "sympy__sympy-21055", "scikit-learn__scikit-learn-13439", "scikit-learn__scikit-learn-25500", "scikit-learn__scikit-learn-13779", "mwaskom__seaborn-3407", "django__django-15213", "sympy__sympy-20049", "matplotlib__matplotlib-22835", "sympy__sympy-18087", "sympy__sympy-17630", "scikit-learn__scikit-learn-14092", "sympy__sympy-24909", "django__django-12983", "matplotlib__matplotlib-25332" ], "AutoCodeRover": [ "django__django-15347", "sympy__sympy-18189", "django__django-15851", "django__django-14915", "django__django-13757", "django__django-16910", "django__django-14672", "sympy__sympy-15011", "sympy__sympy-21847", "django__django-11133", "sympy__sympy-20154", "scikit-learn__scikit-learn-14894", "sympy__sympy-20322", "scikit-learn__scikit-learn-14087", "sympy__sympy-21055", "django__django-12700", "sympy__sympy-13647", "django__django-13551", "django__django-12286", "django__django-16873", "pytest-dev__pytest-5227", "scikit-learn__scikit-learn-12471", "django__django-12747", "sympy__sympy-22005", "sympy__sympy-12236" ], "Aider": [ "django__django-10914", "matplotlib__matplotlib-23562", "django__django-11099", "matplotlib__matplotlib-26011", "sympy__sympy-22714", "sympy__sympy-20442", "matplotlib__matplotlib-23987", "sympy__sympy-20212", "scikit-learn__scikit-learn-25747", "sympy__sympy-13971", "sympy__sympy-21847", "django__django-11133", "sympy__sympy-24152", "sympy__sympy-20154", "sympy__sympy-20590", "django__django-15789", "django__django-16595", "sympy__sympy-20322", "sympy__sympy-17139", "scikit-learn__scikit-learn-14087", "scikit-learn__scikit-learn-13439", "scikit-learn__scikit-learn-25500", "sympy__sympy-21379", "django__django-13551", "scikit-learn__scikit-learn-10508", "django__django-12286", "pytest-dev__pytest-5692", "django__django-15213", "matplotlib__matplotlib-22835", "matplotlib__matplotlib-23913", "sympy__sympy-17630", "django__django-12983", "sympy__sympy-22005", "sympy__sympy-24102", "scikit-learn__scikit-learn-25570" ], "SWE-Agent+": [ "scikit-learn__scikit-learn-15535", "sympy__sympy-14774", "pydata__xarray-4094", "sympy__sympy-18189", "matplotlib__matplotlib-25498", "matplotlib__matplotlib-23562", "sympy__sympy-24066", "scikit-learn__scikit-learn-13584", "django__django-13590", "matplotlib__matplotlib-26011", "sympy__sympy-18835", "sympy__sympy-22714", "sympy__sympy-13437", "django__django-16255", "sympy__sympy-20442", "matplotlib__matplotlib-25079", "sympy__sympy-20212", "sympy__sympy-21614", "sympy__sympy-23117", "sympy__sympy-21847", "sympy__sympy-20154", "sympy__sympy-20590", "scikit-learn__scikit-learn-14894", "django__django-15789", "scikit-learn__scikit-learn-13496", "sympy__sympy-20322", "sympy__sympy-17139", "scikit-learn__scikit-learn-14087", "django__django-13220", "django__django-12308", "sympy__sympy-21055", "scikit-learn__scikit-learn-13439", "scikit-learn__scikit-learn-25500", "scikit-learn__scikit-learn-13779", "sympy__sympy-21379", "matplotlib__matplotlib-23964", "sympy__sympy-13647", "mwaskom__seaborn-3407", "scikit-learn__scikit-learn-10508", "sympy__sympy-18057", "matplotlib__matplotlib-24149", "sympy__sympy-13480", "pytest-dev__pytest-5692", "matplotlib__matplotlib-22835", "scikit-learn__scikit-learn-12471", "scikit-learn__scikit-learn-14092", "sympy__sympy-22005", "sympy__sympy-24102", "matplotlib__matplotlib-25332", "scikit-learn__scikit-learn-25570", "sympy__sympy-14396" ], "SWE-Agent": [ "scikit-learn__scikit-learn-15535", "pydata__xarray-4094", "sympy__sympy-18189", "matplotlib__matplotlib-25498", "matplotlib__matplotlib-23562", "sympy__sympy-24066", "scikit-learn__scikit-learn-13584", "django__django-13590", "matplotlib__matplotlib-26011", "sympy__sympy-22714", "sympy__sympy-13437", "sympy__sympy-20442", "matplotlib__matplotlib-25079", "sympy__sympy-20212", "sympy__sympy-21614", "sympy__sympy-23117", "pydata__xarray-5131", "sympy__sympy-20154", "sympy__sympy-20590", "scikit-learn__scikit-learn-14894", "django__django-15789", "django__django-13448", "sympy__sympy-17139", "scikit-learn__scikit-learn-14087", "scikit-learn__scikit-learn-13439", "scikit-learn__scikit-learn-25500", "scikit-learn__scikit-learn-13779", "matplotlib__matplotlib-23964", "mwaskom__seaborn-3407", "scikit-learn__scikit-learn-10508", "sympy__sympy-18057", "scikit-learn__scikit-learn-13142", "sympy__sympy-13480", "pytest-dev__pytest-5692", "matplotlib__matplotlib-22835", "scikit-learn__scikit-learn-12471", "matplotlib__matplotlib-23913", "scikit-learn__scikit-learn-14092", "sympy__sympy-18199", "sympy__sympy-22005", "matplotlib__matplotlib-25332", "django__django-12589", "scikit-learn__scikit-learn-25570", "sympy__sympy-14396" ] } ```
nashid commented 3 weeks ago

@nielstron thanks for your response. If I understand correctly, the JSON refers to the list of resolved instances per approach. However, I am more interested in identifying which GitHub issues had correct test cases generated by these agents. Could you provide a list of those issues and the specific test cases generated for each one?

nielstron commented 3 weeks ago

Hi, please look through the evaluation harness logs. They each contain a detailed list of what test case was predicted, which test cases are P->F, P->P, etc

You can find these details for each run in _approach_/_model_/_instance_

https://files.sri.inf.ethz.ch/swt-bench/run_instance_swt_logs/

For example, the file aider_gpt-4-1106-preview/aider_gpt-4-1106-preview/astropy__astropy-7746/report.json contains the following complete list of changed tests due to the prediction in tests_pred, where the FAIL_TO_PASS section would be the "correct test cases" you are referring to:


    "astropy__astropy-7746": {
        "patch_is_None": false,
        "patch_exists": true,
        "patch_successfully_applied": true,
        "resolved": false,
        "coverage_pred": 0.75,
        "coverage_gold": 1.0,
        "coverage_base": 0.5,
        "coverage_delta_pred": 0.5,
        "coverage_delta_gold": 1.0,
        "added_f2p": 0,
        "tests_base": {
            "FAIL_TO_PASS": [],
            "PASS_TO_PASS": [
                "astropy/wcs/tests/test_wcs.py::test_inconsistent_sip",
               ...
                "astropy/wcs/tests/test_wcs.py::test_error_message",
                "astropy/wcs/tests/test_wcs.py::test_sip_broken",
                "astropy/wcs/tests/test_wcs.py::test_broadcasting"
            ],
            "FAIL_TO_FAIL": [],
            "PASS_TO_FAIL": [],
            "UNMATCHED": []
        },
        "tests_pred": {
            "FAIL_TO_PASS": [],
            "PASS_TO_PASS": [
                "astropy/wcs/tests/test_wcs.py::test_inconsistent_sip",
                ....
                "astropy/wcs/tests/test_wcs.py::test_sip_broken",
                "astropy/wcs/tests/test_wcs.py::test_broadcasting"
            ],
            "FAIL_TO_FAIL": [
                "astropy/wcs/tests/test_wcs.py::test_empty_world2pix",
                "astropy/wcs/tests/test_wcs.py::test_empty_pix2world_array",
                "astropy/wcs/tests/test_wcs.py::test_empty_world2pix_array",
                "astropy/wcs/tests/test_wcs.py::test_empty_pix2world"
            ],
            "PASS_TO_FAIL": [],
            "UNMATCHED": []
        },
        "tests_gold": {
            "FAIL_TO_PASS": [
                "astropy/wcs/tests/test_wcs.py::test_zero_size_input"
            ],
            "PASS_TO_PASS": [
                "astropy/wcs/tests/test_wcs.py::test_inconsistent_sip",
                "astropy/wcs/tests/test_wcs.py::test_passing_ImageHDU",
              ...
                "astropy/wcs/tests/test_wcs.py::test_sip_broken",
                "astropy/wcs/tests/test_wcs.py::test_broadcasting"
            ],
            "FAIL_TO_FAIL": [],
            "PASS_TO_FAIL": [],
            "UNMATCHED": []
        }
    }
}

i.e. in the above case, the model did not predict a single FAIL_TO_PASS case but the golden suite contains exactly one, namely test_zero_size_input. The predicted tests can be found in aider_gpt-4-1106-preview/aider_gpt-4-1106-preview/astropy__astropy-7746/extracted_patch.diff, in this case the predicted cases that are shown under FAIL_TO_FAIL:

diff --git a/astropy/wcs/tests/test_wcs.py b/astropy/wcs/tests/test_wcs.py
index 85853e10e5..0fb7d22416 100644
--- a/astropy/wcs/tests/test_wcs.py
+++ b/astropy/wcs/tests/test_wcs.py
@@ -1093,3 +1093,31 @@ def test_keyedsip():
     assert isinstance( w.sip, wcs.Sip )
     assert w.sip.crpix[0] == 2048
     assert w.sip.crpix[1] == 1026
+def test_empty_pix2world():
+    # Test for passing empty lists/arrays to wcs_pix2world
+    wcs = WCS(get_pkg_data_filename('data/sip.fits'))
+    result = wcs.wcs_pix2world([], [], 0)
+    assert result == ([], [])
+
+def test_empty_world2pix():
+    # Test for passing empty lists/arrays to wcs_world2pix
+    wcs = WCS(get_pkg_data_filename('data/sip.fits'))
+    result = wcs.wcs_world2pix([], [], 0)
+    assert result == ([], [])
+
+def test_empty_pix2world_array():
+    # Test for passing empty numpy arrays to wcs_pix2world
+    wcs = WCS(get_pkg_data_filename('data/sip.fits'))
+    result = wcs.wcs_pix2world(np.array([]), np.array([]), 0)
+    assert result == (np.array([]), np.array([]))
+
+def test_empty_world2pix_array():
+    # Test for passing empty numpy arrays to wcs_world2pix
+    wcs = WCS(get_pkg_data_filename('data/sip.fits'))
+    result = wcs.wcs_world2pix(np.array([]), np.array([]), 0)
+    assert result == (np.array([]), np.array([]))
+
+from astropy.wcs import WCS
+import numpy as np
+from astropy.utils.data import get_pkg_data_filename
+

We also provide a host of tooling for assessing the logs provided in the figures folder, which is specificially the code used to compute the numbers for the tables and figures in the final version of the paper.

I hope this helps UPDATE: added more concrete examples