raymyers / swe-bench-util

Scripts for working with SWE-Bench, the AI coding agent benchmark
Apache License 2.0
6 stars 2 forks source link

SWE-bench Util

Scripts for working with SWE-bench, the AI coding agent benchmark.

If you are trying to beat Devin, see also The SWE-bench fork from OpenAgentsInc to run your agent.

Features

Setup

Install poetry if you don't have it

python3 -m pip install poetry

If using a feature that requires a vendor API, copy .env.example to .env and fill in the values.

Install dependencies and initialize an editable command

poetry install

Run

swe_bench_util --help

This assumes the poetry install has gone onto your path, otherwise you can use python -m swe_bench_util.

Save the first example case. This will download the full dataset on first run, caching it with the datasets library.

swe_bench_util get rows --split 'dev[0:1]'

Output

File 'examples/sqlfluff__sqlfluff-4764.json' was saved
File 'examples/sqlfluff__sqlfluff-4764.md' was saved

Use jq to show a subset of the JSON.

jq '. | {repo, instance_id, base_commit, problem_statement}' examples/sqlfluff__sqlfluff-4764.json

Save the Oracle (patched file list) for the dev subset.

swe_bench_util get oracle

Output:

File 'examples/oracle.json' was saved
jq '.[] | .repo' examples/oracle.json  | jq -s 'unique'
jq '.[] | {repo, base_commit}' examples/oracle.json  | jq -s 'unique'

Git checkout the repo / base_commit of an example. swe-bench-util checkout --id pydicom__pydicom-793

index and run inference with astra-assistants:

Make sure you have your keys set up in .env

cp .env.backup .env

and set your keys. Then run the index command:

swe_bench_util index astra-assistants

Output:

...
Files used in retrieval: ["test_wcs.py", "wcs.py", "test_utils.py", "test_transform_coord_meta.py", "CHANGES.rst", "test_images.py", "test_misc.py"]
...

Data

By default, most commands will operate on the dev split, using the Huggingface datasets API. You can specify a split using --split, for instance:

You can also filter by repo or id. Filters are applied after split, so if you select a row range and a filter you may come up empty.

Here is the shape of the data.

    dev: Dataset({
        features: ['repo', 'instance_id', 'base_commit', 'patch', 'test_patch', 'problem_statement', 'hints_text', 'created_at', 'version', 'FAIL_TO_PASS', 'PASS_TO_PASS', 'environment_setup_commit'],
        num_rows: 225
    })
    test: Dataset({
        features: ['repo', 'instance_id', 'base_commit', 'patch', 'test_patch', 'problem_statement', 'hints_text', 'created_at', 'version', 'FAIL_TO_PASS', 'PASS_TO_PASS', 'environment_setup_commit'],
        num_rows: 2294
    })
    train: Dataset({
        features: ['repo', 'instance_id', 'base_commit', 'patch', 'test_patch', 'problem_statement', 'hints_text', 'created_at', 'version', 'FAIL_TO_PASS', 'PASS_TO_PASS', 'environment_setup_commit'],
        num_rows: 19008
    })

Checks

make check

That is equivelant to:

python -m pytest

python -m ruff check --fix

python -m ruff format