sourcegraph / cody

AI that knows your entire codebase
https://cody.dev
Apache License 2.0
2.22k stars 213 forks source link

Bench: add hedging and conciseness scores for Chat #4693

Closed jtibshirani closed 4 days ago

jtibshirani commented 4 days ago

This PR adds 'hedging' and 'conciseness' checks for Chat benchmarks. The 'conciseness' check uses an LLM to judge repetitiveness.

It also tweaks the main LLM judge score to just focus on helpfulness, and not penalize models for apologizing. This makes the scores less biased, since some models tend to apologize more ("I'm sorry...") instead of using other hedging terms like "Unfortunately...", but otherwise the responses are very similar.

Test plan

Ran pnpm agent:skip-root-build cody-bench --evaluation-config ~/code/cody-leaderboard/chat-bench.json and checked output.

jtibshirani commented 4 days ago

I'll soon open a cody-leaderboard PR to display the new scores like this:

Screenshot 2024-06-26 at 12 27 01 PM