Bench: add LLM judge for response scoring to chat strategy

abeatrix commented 5 days ago

Add llmJudgeChatTemplate to generate a prompt for evaluating LLM responses
Integrate LlmJudge into the evaluateChatStrategy to score each chat response
Store the numeric score in the EvaluationDocument for each chat response
Log the total score for the fixture at the end of the evaluation

There are still follow-up works that we can do, including improving the prompt used for the LLM judge, but currently, it works as intended and IMO a good starting point for us to build from.

Test plan

Currently being used by https://github.com/sourcegraph/cody-leaderboard/pull/8

github-actions[bot] commented 5 days ago

‼️ Hey @sourcegraph/cody-security, please review this PR carefully as it introduces the usage of an unsafe_ function or abuses PromptString.

abeatrix commented 4 days ago

as discussed with @jtibshirani , i am running into issue with re-running the tests so we will merge what we have now and update the prompt and scoring in follow up PRs

sourcegraph / cody

Bench: add LLM judge for response scoring to chat strategy #4678

Test plan