raymyers / swe-bench-util

Scripts for working with SWE-Bench, the AI coding agent benchmark
Apache License 2.0
6 stars 2 forks source link

Recall with Assistants #10

Closed phact closed 5 months ago

phact commented 5 months ago

In addition to cloning and indexing, the call now evaluates recall with assistants and writes the output to recall/results.json.

Some sample output:

[
  {
    "id": "astropy__astropy-11693",
    "hint_files": [
      "docs/wcs/examples/programmatic.py",
      "docs/wcs/relax.rst",
      ".github/labeler.yml",
      "astropy/wcs/tests/test_wcs.py",
      "docs/wcs/legacy_interface.rst",
      "CHANGES.rst",
      "astropy/wcs/wcsapi/tests/test_fitswcs.py",
      "astropy/wcs/include/astropy_wcs_api.h",
      "astropy/wcs/wcs.py"
    ],
    "patch_files": [
      "astropy/wcs/wcsapi/fitswcs.py"
    ],
    "test_patch_files": [
      "astropy/wcs/wcsapi/tests/test_fitswcs.py"
    ],
    "precision": 0.0,
    "recall": 0.0,
    "search_string": "astropy wcsapi fitswcs.py add quiet parameter to all_world2pix call"
  }
]
raymyers commented 5 months ago

Great looking forward to seeing how this performs, and maybe we can figure out which other methods are in play to get comparisons.