Is it possible to evaluate the train set?

Describe the issue

Hello! I'd like to evaluate outputs coming from train set prompts. Is this possible?

From what I can see, it seems like we can only train on the train set, not evaluate it, right? To fix this, we would need to:

Be able to clone repos which are not part of the org's forked list. Related to this issue.
Add the list of required packages for each repo to MAP_VERSION_TO_INSTALL.
Add the test framework for each repo to MAP_REPO_TO_TEST_FRAMEWORK.
Write a parser for the logs of each repo: MAP_REPO_TO_PARSER.

Is there anything else that needs to change? Is there a way of doing this besides manually going through each repo version and adding the relevant info?

Thanks for your time!

Suggest an improvement to documentation

No response

Hi @chriscremer thanks for the great questions!

All of your observations are correct, with minor tweaks:

Hello! I'd like to evaluate outputs coming from train set prompts. Is this possible?

In the current SWE-bench repository, this is not possible.

From what I can see, it seems like we can only train on the train set, not evaluate it, right?

Yes

To fix this, we would need to... [steps]

Yes, these are all correct! So we do actually have a bunch of repositories cloned under the SWE-bench organization, but they are not public at the moment, just because there hasn't been a need. However, if you are still interested in setting up execution based evaluation for repositories in the training set, we can certainly coordinate and I'd be happy to make public the repositories that we have so far.

Is there a way of doing this besides manually going through each repo version and adding the relevant info?

Unfortunately, this is the one step of SWE-bench that is still heavily manual. We can do automatic collection + validation, but defining the installation specifications is still a manual process at the moment. From a research angle, some teams have definitely been working on this problem, but I'm not aware of any system or agent that can reliably define installation specifications for a repository in an automatic way.

Thanks again for the great questions! I'll mark this as completed for now, but do feel free to re-open this issue or create a new issue for any additional questions.

princeton-nlp / SWE-bench

Is it possible to evaluate the train set? #116

Describe the issue

Suggest an improvement to documentation