Closed abeatrix closed 2 days ago
I like how the preamble is still simple and to-the-point! I think we need to apply it only to the Claude 3.5 models though. I dug through the leaderboard results, and noticed significantly more hallucination from all other models: https://github.com/sourcegraph/cody-leaderboard/pull/12#issuecomment-2203937606.
I think of this addition like a "confidence preamble" :) We should apply it only to models who demonstrate excessive hedging behavior, like Claude 3.5 (and previously Claude 2.1).
apply it only to the Claude 3.5 models though
Ahh good point! Let me make that change, thanks for the suggestion @jtibshirani !
PART OF https://linear.app/sourcegraph/issue/CODY-2507/evaluate-claude-35-sonnet-as-new-default
Update the preamble that works across different models while minimizing hedgings.
Test plan
Updated leaderboard locally with results from this change in https://github.com/sourcegraph/cody-leaderboard/pull/12
With the updated preamble, we can see it has decreased the hedging occurrences (having issues when running the evals so I removed the gemini models to get it to run quicker):
The one question that failed looks like a false positive to me:
Before