This PR adds 'hedging' and 'conciseness' checks for Chat benchmarks. The 'conciseness' check uses an LLM to judge repetitiveness.
It also tweaks the main LLM judge score to just focus on helpfulness, and not penalize models for apologizing. This makes the scores less biased, since some models tend to apologize more ("I'm sorry...") instead of using other hedging terms like "Unfortunately...", but otherwise the responses are very similar.
Test plan
Ran pnpm agent:skip-root-build cody-bench --evaluation-config ~/code/cody-leaderboard/chat-bench.json and checked output.
This PR adds 'hedging' and 'conciseness' checks for Chat benchmarks. The 'conciseness' check uses an LLM to judge repetitiveness.
It also tweaks the main LLM judge score to just focus on helpfulness, and not penalize models for apologizing. This makes the scores less biased, since some models tend to apologize more ("I'm sorry...") instead of using other hedging terms like "Unfortunately...", but otherwise the responses are very similar.
Test plan
Ran
pnpm agent:skip-root-build cody-bench --evaluation-config ~/code/cody-leaderboard/chat-bench.json
and checked output.