novoselrok commented 1 year ago

We need a way to evaluate the effect of our changes to Cody. This project aims to collect the various questions and queries we are using to test Cody's responses manually and consolidate them in a test suite.

Here is how we would define a test case:

addTestCase('Sourcegraph frontend feature flags', {
    codebase: 'github.com/sourcegraph/sourcegraph',
    context: 'embeddings',
    transcript: [
        {
            question: 'How are feature flags used in sourcegraph frontend?',
            facts: ['useFeatureFlag', 'FeatureFlagName', 'featureFlags.ts', '/site-admin/feature-flags'],
            answerSummary: 'Feature flags allow developers to ship new features that are hidden behind a flag.',
        },
        {
            question: 'How can I add a new one?',
            facts: ['FeatureFlagName'],
            answerSummary: 'Add a new case to the `FeatureFlagName` union type.',
        },
    ],
})

A test case consists of the following:

A label
The codebase we want to test against
Type of context (embeddings, unified, none, etc.)
The transcript
- Each interaction in the transcript has a question, a list of facts that should be in the answer, and an answer summary that captures the core ideas of the correct answer.
- We will also check for hallucinated symbols and file paths in Cody's answer.
- An external LLM will use the answer summary to judge whether Cody's answer captures the core ideas.

Facts, hallucination detection, and an LLM judge alone are not enough to verify the quality of Cody's answers. Hopefully, by combining them, we will get a clearer picture.

Tasks

Phase 1 (Foundations, CLI)

[x] Add a new package to the Cody repo with access to the shared library.
[x] Get access to Cody using the createClient function.
[x] Implement the Cody request/response cycle (sending Cody a question and getting an answer back).
[x] Define the test case structure (see above).
[x] Implement fact-checking (check if the provided facts are present in the answer).
[x] As a part of fact-checking: implement hallucination detection (detect symbols and file paths and use the Sourcegraph search API to verify whether they exist in the codebase).
[x] Implement the LLM judge.
[x] Aggregate the test results and output the summary to stdout (w/ colors, if possible).
[x] Output raw test results to a JSON file for post-processing.

Phase 2 (Scale)

[x] Create a web app to inspect the raw test results interactively.
- Sort the test results by % correct answers, % wrong answers, % hallucinations, ...
- Filter by label
- Inspect Cody's answers, missing facts, hallucinations, LLM judgments, ...
[ ] Collect 100+ test cases from various codebases.
[ ] Add support for non-chat-question recipes (e.g., explain code, etc.)

Phase 3 (Nice-to-have)

[ ] Diff between two test result sets (+/- % correct answers, +/- % hallucinations) to easily track changes over time.
[ ] Add support for evaluating code completions.

novoselrok commented 1 year ago

Phase 1 done in https://github.com/sourcegraph/cody/pull/590

github-actions[bot] commented 10 months ago

This issue is marked as stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed automatically in 5 days.

sourcegraph / cody

End-to-end Cody quality evaluation #537

Tasks

Phase 1 (Foundations, CLI)

Phase 2 (Scale)

Phase 3 (Nice-to-have)