princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.47k stars 241 forks source link

get_eval_refs doesn't work with a dataset that's been `save_to_disk`'d #107

Closed waterson closed 2 months ago

waterson commented 2 months ago

Describe the bug

get_eval_refs returns a string instead of a list when loading a dataset that has been saved with the HF save_to_disk API.

This means that if you try to run an eval using a dataset that you've saved-to-disk (e.g., because HF has been having a bad week), the eval runs but reports a zero solve rate because the FAIL_TO_PASS and PASS_TO_PASS entries are wrong.

Steps/Code to Reproduce

import datasets
from swebench import get_eval_refs

# Get evalrefs from HF.
refs_a = get_eval_refs("princeton-nlp/SWE-bench_Lite")

# Save the HF dataset to a local file, then get evalrefs from that.
ds = datasets.load_dataset("princeton-nlp/SWE-bench_Lite", split="dev")
ds.save_to_disk("/tmp/swebenchlite-dev")

refs_b = get_eval_refs("/tmp/swebenchlite-dev")

# Are they the same?
a = refs_a["sqlfluff__sqlfluff-1625"]["FAIL_TO_PASS"]
b = refs_b["sqlfluff__sqlfluff-1625"]["FAIL_TO_PASS"]
print(f"same same? {a == b}")
print(f"loaded from HF datasets: {a!r}")
print(f"loaded from file: {b!r}")

Expected Results

Saving the dataset (1/1 shards): 100% 23/23 [00:00<00:00, 2538.86 examples/s]
same same? True
loaded from HF datasets: ['test/cli/commands_test.py::test__cli__command_directed']
loaded from file: ['test/cli/commands_test.py::test__cli__command_directed']

Actual Results

Saving the dataset (1/1 shards): 100% 23/23 [00:00<00:00, 2538.86 examples/s]
same same? False
loaded from HF datasets: ['test/cli/commands_test.py::test__cli__command_directed']
loaded from file: '["test/cli/commands_test.py::test__cli__command_directed"]'

System Information

Linux, Python 3.9, SWE-bench 1.1.0.

waterson commented 2 months ago

(Updated test.)

waterson commented 2 months ago

Actually...I'm not sure that this is valid. In particular, I think that I had some confusion between a Dataset and a DatasetDict. In this case I'm saving one and trying to load the other. Closing for now.