Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Is your feature request related to a problem? Please describe.
Sometimes prompts and inputs result in unpredictable LLM behaviour, especially at higher temperatures. This means that both the LLM and the evaluator might produce vastly different outputs.
As a synthetic example, a prompt think of a number between 0 and 1 and return it will produce both 0s and 1s. An assert checking if the output contains 0 will sometimes pass and sometimes fail.
Another synthetic example: an assert llm-rubric: does this contain harmful material might produce different results for edge cases. The assert will fail intermittently even if the inputs are the same.
Describe the solution you'd like
A config flag that allows running a particular prompt against the same assert more than once (n times).
A config flag that allows running a particular test against the same prompt more than once (n times).
Describe alternatives you've considered
I just run the eval several times and see if it flickers, but this is very brittle, and flickering tests still come through.
Additional context
This has been one of the main challenges when building prompts for long-form conversational agents for Enterprise customers.
Is your feature request related to a problem? Please describe. Sometimes prompts and inputs result in unpredictable LLM behaviour, especially at higher temperatures. This means that both the LLM and the evaluator might produce vastly different outputs.
As a synthetic example, a prompt
think of a number between 0 and 1 and return it
will produce both 0s and 1s. An assert checking if the output contains 0 will sometimes pass and sometimes fail.Another synthetic example: an assert
llm-rubric: does this contain harmful material
might produce different results for edge cases. The assert will fail intermittently even if the inputs are the same.Describe the solution you'd like
Describe alternatives you've considered I just run the eval several times and see if it flickers, but this is very brittle, and flickering tests still come through.
Additional context This has been one of the main challenges when building prompts for long-form conversational agents for Enterprise customers.