raymyers / swe-bench-util

Scripts for working with SWE-Bench, the AI coding agent benchmark
Apache License 2.0
6 stars 2 forks source link

Benchmark Retrieval vs Oracle #3

Open raymyers opened 5 months ago

raymyers commented 5 months ago

Developing on #1 and @aorwall's suggestion, create an easy way to test an agent against the oracle in terms of identifying the files to be modified.

I was thinking a simple method interface. Something like benchmark_retrieval(path: str, query: str, expected_files: list[str]): . The output could be on which position in the search result each file is found. Maybe also provide the line numbers of the changes in the files to verify the returned chunks. Next step could then be to just let the method only verify that the exact files (or code snippets?) are returned. Then it would be possible to benchmark the Assistants API with it as well? My plan is to test different combinations with Llama Index as it makes it easy to switch between different embedding models, transformers and retrievers. There is a CodeSplitter in Llama Index based on Sweep's parser also, it's missing some tweaks sweep has in their parser though. I'm trying to figure out if LlamaIndex has a cache for embeddings also. Like Sweep has in vector_db.

Not sure what to call the agent being compared other than oracle replacement.

raymyers commented 5 months ago

I think what we're doing here is benchmarking a recommendation engine, and therefore these standard classifier metrics should be useful: Precision and Recall.

Or in our domain:

patch_files
recommended_files
true_positives = recommended_files & patch_files
precision = len(true_positives) / len(recommended_files)
recall = len(true_positives) / len(patch_files)
raymyers commented 5 months ago

Added calculations added to swe_bench_util/file_hint_eval.py, now it needs to be easy to check an agent run vs the oracle