Create prompt testing tool

sourcegraph / cody

Type less, code more: Cody is an AI code assistant that uses advanced search and codebase context to help you write and fix code.

Apache License 2.0

2.6k stars 278 forks source link

Links:

My April 24th brain dump
Branch: https://github.com/sourcegraph/sourcegraph/tree/dv/prompt-tester
Crappy 1-min demo
Cody feedback spreadsheet
@jtibshirani's prompt list

Idea list:

Check out nat/openplayground
CI-related
- automated validations for each prompt, e.g. “result should contain the term ‘Batch changes’.”
- the tool could have a CLI version to run in CI, combined with the automated validations, to make sure that we don’t have regressions as we tweak the prompts
Cody VS Code conversation export ← I’d add this next
structure
- Add context file selector
- break up the prompt into several back-and-forth messages between the human and assistant
- add a preamble: a common set of messages before all prompts, which can also be persisted
prompt persistence (a simple DB table) so that we don’t need to paste them over and over again
maybe also
- add another toggle for the context search (just like the LLM selector) to try embedding-based search, keyword search, etc.
- we might also want manual scoring for the prompts to allow qualitative metrics for them
- maybe try to hook up an LLM-based feedback loop to automatically iterate on the prompts

@dominiccooney's input:

This is a great start. Here's what I need, based on my experience hacking prompts to this point, but maybe my use is so idiosyncratic you should not worry about it:

Inputs are N bags of strings. Some bags are noted as prompt; others as user input. And a function which takes a string from each bag and produces a closure which produces LLM output.

Kick off multiple concurrent requests. Key principle is the driver never has to wait for LLMs to spin.

Driver sits rating two-up results for which one is better.

Algorithm does stats magic to find the best prompt combination, report on what that is, what the error bars are, etc. That algorithm could be GP, or something for sports team ranking, or RL, or whatever. This is where certain bags of strings being notated as user input is important: You want the algorithm to specialize on the good prompts, but measure the performance generalizes across the range of user inputs.

Footnote, capture all of this data so we can RHLF our own models with this data later.

If any of those ideas aren't clear, I'm happy for follow-ups.

sourcegraph / cody

Create prompt testing tool #220