promptfoo / promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
https://promptfoo.dev
MIT License
4.86k stars 387 forks source link

Allow running llm-graded tests multiple times to find flickering tests #1932

Open sbichenko opened 1 month ago

sbichenko commented 1 month ago

Is your feature request related to a problem? Please describe. Sometimes prompts and inputs result in unpredictable LLM behaviour, especially at higher temperatures. This means that both the LLM and the evaluator might produce vastly different outputs.

As a synthetic example, a prompt think of a number between 0 and 1 and return it will produce both 0s and 1s. An assert checking if the output contains 0 will sometimes pass and sometimes fail.

Another synthetic example: an assert llm-rubric: does this contain harmful material might produce different results for edge cases. The assert will fail intermittently even if the inputs are the same.

Describe the solution you'd like

Describe alternatives you've considered I just run the eval several times and see if it flickers, but this is very brittle, and flickering tests still come through.

Additional context This has been one of the main challenges when building prompts for long-form conversational agents for Enterprise customers.

sbichenko commented 4 weeks ago

A workaround is available here: https://github.com/promptfoo/promptfoo/issues/1888