Open raymyers opened 5 months ago
I think what we're doing here is benchmarking a recommendation engine, and therefore these standard classifier metrics should be useful: Precision and Recall.
Or in our domain:
patch_files
recommended_files
true_positives = recommended_files & patch_files
precision = len(true_positives) / len(recommended_files)
recall = len(true_positives) / len(patch_files)
Added calculations added to swe_bench_util/file_hint_eval.py
, now it needs to be easy to check an agent run vs the oracle
Developing on #1 and @aorwall's suggestion, create an easy way to test an agent against the oracle in terms of identifying the files to be modified.
Not sure what to call the agent being compared other than oracle replacement.