Closed EwoutH closed 7 months ago
It would be very useful to have some Claude 3 benchmarks using SWE-Agent.
Since Claude 3 Opus performs better than GPT-4 using RAG, there’s a fair chance that will also be the case when using SWE-Agents, right?
Also really curious how Claude 3 Sonnet and Haiku perform, since they not that far behind Opus (and way cheaper).
We have some initial numbers on swebench.com . See https://twitter.com/OfirPress/status/1775226081575915661
We'll have more soon.
It would be very useful to have some Claude 3 benchmarks using SWE-Agent.
Since Claude 3 Opus performs better than GPT-4 using RAG, there’s a fair chance that will also be the case when using SWE-Agents, right?
Also really curious how Claude 3 Sonnet and Haiku perform, since they not that far behind Opus (and way cheaper).