[NeurIPS 2024] SWE-agent takes a GitHub issue and tries to automatically fix it, using GPT-4, or your LM of choice. It can also be employed for offensive cybersecurity or competitive coding challenges.
First off, I'd like to say thank you so much for publishing SWE-bench and SWE-agent. I was wonder is their anywhere that has the logs from running SWE-Bench/SWE-ENG evaluation are posted? I am working on some langchain scripts to categorize and groups bugs/features that are being used to evaluate the models/agents and I'd like to dig into what issues fail or succeeded.
I noticed on a previous issue @carlosejimenez provide links to generated results from Claude and GPTs. Is there something similar available but for the logs resulting from evaluation? Thanks so much in advance.
First off, I'd like to say thank you so much for publishing SWE-bench and SWE-agent. I was wonder is their anywhere that has the logs from running SWE-Bench/SWE-ENG evaluation are posted? I am working on some langchain scripts to categorize and groups bugs/features that are being used to evaluate the models/agents and I'd like to dig into what issues fail or succeeded.
I noticed on a previous issue @carlosejimenez provide links to generated results from Claude and GPTs. Is there something similar available but for the logs resulting from evaluation? Thanks so much in advance.