princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
1.47k stars 241 forks source link

Is it possible to evaluate the train set? #116

Closed chriscremer closed 2 weeks ago

chriscremer commented 1 month ago

Describe the issue

Hello! I'd like to evaluate outputs coming from train set prompts. Is this possible?

From what I can see, it seems like we can only train on the train set, not evaluate it, right? To fix this, we would need to:

Is there anything else that needs to change? Is there a way of doing this besides manually going through each repo version and adding the relevant info?

Thanks for your time!

Suggest an improvement to documentation

No response

john-b-yang commented 2 weeks ago

Hi @chriscremer thanks for the great questions!

All of your observations are correct, with minor tweaks:

Hello! I'd like to evaluate outputs coming from train set prompts. Is this possible?

In the current SWE-bench repository, this is not possible.

From what I can see, it seems like we can only train on the train set, not evaluate it, right?

Yes

To fix this, we would need to... [steps]

Yes, these are all correct! So we do actually have a bunch of repositories cloned under the SWE-bench organization, but they are not public at the moment, just because there hasn't been a need. However, if you are still interested in setting up execution based evaluation for repositories in the training set, we can certainly coordinate and I'd be happy to make public the repositories that we have so far.

Is there a way of doing this besides manually going through each repo version and adding the relevant info?

Unfortunately, this is the one step of SWE-bench that is still heavily manual. We can do automatic collection + validation, but defining the installation specifications is still a manual process at the moment. From a research angle, some teams have definitely been working on this problem, but I'm not aware of any system or agent that can reliably define installation specifications for a repository in an automatic way.

Thanks again for the great questions! I'll mark this as completed for now, but do feel free to re-open this issue or create a new issue for any additional questions.