Claude 3 benchmarks - Githubissues

princeton-nlp / SWE-agent

[NeurIPS 2024] SWE-agent takes a GitHub issue and tries to automatically fix it, using GPT-4, or your LM of choice. It can also be employed for offensive cybersecurity or competitive coding challenges.

https://swe-agent.com

MIT License

13.67k stars 1.38k forks source link

Claude 3 benchmarks #15

Closed EwoutH closed 7 months ago

EwoutH commented 7 months ago

It would be very useful to have some Claude 3 benchmarks using SWE-Agent.

Since Claude 3 Opus performs better than GPT-4 using RAG, there’s a fair chance that will also be the case when using SWE-Agents, right?

Also really curious how Claude 3 Sonnet and Haiku perform, since they not that far behind Opus (and way cheaper).

ofirpress commented 7 months ago

We have some initial numbers on swebench.com . See https://twitter.com/OfirPress/status/1775226081575915661

We'll have more soon.