Bench: add hedging and conciseness scores for Chat

This PR adds 'hedging' and 'conciseness' checks for Chat benchmarks. The 'conciseness' check uses an LLM to judge repetitiveness.

It also tweaks the main LLM judge score to just focus on helpfulness, and not penalize models for apologizing. This makes the scores less biased, since some models tend to apologize more ("I'm sorry...") instead of using other hedging terms like "Unfortunately...", but otherwise the responses are very similar.

Test plan

Ran pnpm agent:skip-root-build cody-bench --evaluation-config ~/code/cody-leaderboard/chat-bench.json and checked output.

sourcegraph / cody

Bench: add hedging and conciseness scores for Chat #4693

Test plan