Allow running llm-graded tests multiple times to find flickering tests

Is your feature request related to a problem? Please describe. Sometimes prompts and inputs result in unpredictable LLM behaviour, especially at higher temperatures. This means that both the LLM and the evaluator might produce vastly different outputs.

As a synthetic example, a prompt think of a number between 0 and 1 and return it will produce both 0s and 1s. An assert checking if the output contains 0 will sometimes pass and sometimes fail.

Another synthetic example: an assert llm-rubric: does this contain harmful material might produce different results for edge cases. The assert will fail intermittently even if the inputs are the same.

Describe the solution you'd like

A config flag that allows running a particular prompt against the same assert more than once (n times).
A config flag that allows running a particular test against the same prompt more than once (n times).

Describe alternatives you've considered I just run the eval several times and see if it flickers, but this is very brittle, and flickering tests still come through.

Additional context This has been one of the main challenges when building prompts for long-form conversational agents for Enterprise customers.

promptfoo / promptfoo

Allow running llm-graded tests multiple times to find flickering tests #1932