Evaluation new models - Githubissues

nus-apr / auto-code-rover

A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 30.67% tasks (pass@1) in SWE-bench lite and 38.40% tasks (pass@1) in SWE-bench verified with each task costs less than $0.7.

Other

2.67k stars 276 forks source link

Evaluation new models #36

Open elmoBG8 opened 5 months ago

elmoBG8 commented 5 months ago

Hello, I see you added new supported models. Can you provide an evaluation of them on SWE-bench so that it can be compared with the evaluations already done?

Thank you

qrdlgit commented 5 months ago

Yes, the entire engineering world is holding their breath :)

SWE-bench is probably the best eval out there right now, and this is one of the best ways to evaluate swe-bench!