nus-apr / auto-code-rover

A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 30.67% tasks (pass@1) in SWE-bench lite and 38.40% tasks (pass@1) in SWE-bench verified with each task costs less than $0.7.
Other
2.67k stars 276 forks source link

Question about Auto Code Rover SWE-bench data #17

Closed harrytormey closed 5 months ago

harrytormey commented 5 months ago

I am planning on writing an article on Auto Code Rover and I was wondering if you could tell me about the format of the SWE-bench test results in: https://github.com/nus-apr/auto-code-rover/tree/main/results/swe-agent-results How am I to interpret the results in this directory? Specifically for Devin they formatted diffs for their SWE-bench run into separate pass/fail directories: https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs How is this done for your results? Thanks in advance and thanks for publishing your work.

-Harry

zhiyufan commented 5 months ago

There is a final_report.json file for each swe-agent-replication. The "resolved" in the final_report.json field represents the resolved task instances in SWE-bench lite. The other .traj files represent the all actions taken by SWE-agent, and conversation history with GPT-4. At the end of a .traj file, there is an "info" field, containing the generated patch (in the form of git diff) if exist.

yuntongzhang commented 5 months ago

Closing this, @harrytormey feel free to let us know if you have more questions.