promptfoo / promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
https://promptfoo.dev
MIT License
4.43k stars 342 forks source link

Silent failure by promptRubric not imputing variables or ignoring `value` #1866

Open strentom opened 3 days ago

strentom commented 3 days ago

Describe the bug The promptRubric is ignored or the imputed parameters are not inserted. I can't tell because the logs (with --verbose) are insufficient.

To reduce likelihood of my error, I searched for existing issues and copied promptRubric from an issue that is resolved (#823). Doing eval on the same YAML (only difference is provider), it silently fails (doesn't produce expected result) and from LLM output I hypothesize that the LLM didn't receive the full prompt. This can be reproduced even with simpler prompts, but I wanted to be sure that the YAML is 100% correct.

To Reproduce Take this YAML as from #823 (modified only the provider):

prompts: >-
  [{
    "role": "system",
    "content": "Translate text into {{ language }}."
  }, {
    "role": "user",
    "content": "{{ input }}"
  }]
providers: vertex:gemini-1.5-flash
defaultTest:
  options:
    rubricPrompt:
      - role: system
        content: >-
          Evaluate the quality of the translation provided by an AI assistant to the user input displayed below.
          Score the response on a scale of 0 to 10.
          Output your response in the following JSON format: {pass: bool, score: number, reason: string}
      - role: user
        content: >-
          [Input Start]
          {{ input }}
          [Input End]

          [Translation Start]
          {{ output }}
          [Translation End]

tests:
  - vars:
      input: "Happy families are all alike. \nAnd every unhappy family is unhappy in its own way."
      language: "Chinese"
    assert:
      - type: llm-rubric

and run promptfoo eval on it.

Expected behavior The promptRubric should rate the translation quality. Instead, it responds: {"pass":false,"reason":"No rubric was provided","score":0,"tokensUsed":{"total":201,"prompt":179,"completion":22,"cached":0}}

Screenshots

Reading prompts from ["[{\n  \"role\": \"system\",\n  \"content\": \"Translate text into {{ language }}.\"\n}, {\n  \"role\": \"user\",\n  \"content\": \"{{ input }}\"\n}]"]
Inserting prompt 3675bb3253267004907c89fcb4eaa51cbc0f6e2d963b914b11fa7e1497969feb
Inserting dataset 14998238815e9383d52ac592abda36278c157fa1fe9ce420837809599f62d532
Coerced JSON prompt to Gemini format: [{"role":"system","content":"Translate text into Chinese."},{"role":"user","content":"Happy families are all alike. \nAnd every unhappy family is unhappy in its own way."}]
Preparing to call Google Vertex API (Gemini) with body: {"contents":{"role":"user","parts":{"text":"Translate text into Chinese.Happy families are all alike. \nAnd every unhappy family is unhappy in its own way."}},"generationConfig":{}}
Gemini API response: [{"candidates":[{"content":{"role":"model","parts":[{"text":"幸福"}]}}],"modelVersion":"gemini-1.5-flash"},{"candidates":[{"content":{"role":"model","parts":[{"text":"的家庭都一样。\n不幸的家庭各有各的不幸。 \n"}]},"safetyRatings":[{"caY_HATE_SPEECH","probability":"NEGLIGIBLE","probabilityScore":0.17480469,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.083984375},{"category":"HARM_CATEGORY_DANGEROUS_CONTENT","probability":"NEGLIGIBLE","probabilityScore":0.040283203,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.02368164},{"category":"HARM_CATEGORY_HARASSMENT","probability":"NEGLIGIBLE","probabilityScore":0.25585938,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.111328125},{"category":"HARM_CATEGORY_SEXUALLY_EXPLICIT","probability":"NEGLIGIBLE","probabilityScore":0.087402344,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.04272461}]}],"modelVersion":"gemini-1.5-flash"},{"candidates":[{"content":{"role":"model","parts":[{"text":""}]},"avgLogprobs":"NaN"}],"modelVersion":"gemini-1.5-flash"},{"candidates":[{"content":{"role":"model","parts":[{"text":""}]},"finishReason":"STOP"}],"usageMetadata":{"promptTokenCount":24,"candidatesTokenCount":17,"totalTokenCount":41},"modelVersion":"gemini-1.5-flash"}]
Gemini API response: [{"candidates":[{"content":{"role":"model","parts":[{"text":"幸福"}]}}],"modelVersion":"gemini-1.5-flash"},{"candidates":[{"content":{"role":"model","parts":[{"text":"的家庭都一样。\n不幸的家庭各有各的不幸。 \n"}]},"safetyRatings":[{"caY_HATE_SPEECH","probability":"NEGLIGIBLE","probabilityScore":0.17480469,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.083984375},{"category":"HARM_CATEGORY_DANGEROUS_CONTENT","probability":"NEGLIGIBLE","probabilityScore":0.040283203,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.02368164},{"category":"HARM_CATEGORY_HARASSMENT","probability":"NEGLIGIBLE","probabilityScore":0.25585938,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.111328125},{"category":"HARM_CATEGORY_SEXUALLY_EXPLICIT","probability":"NEGLIGIBLE","probabilityScore":0.087402344,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.04272461}]}],"modelVersion":"gemini-1.5-flash"},{"candidates":[{"content":{"role":"model","parts":[{"text":""}]},"avgLogprobs":"NaN"}],"modelVersion":"gemini-1.5-flash"},{"candidates":[{"content":{"role":"model","parts":[{"text":""}]},"finishReason":"STOP"}],"usageMetadata":{"promptTokenCount":24,"candidatesTokenCount":17,"totalTokenCount":41},"modelVersion":"gemini-1.5-flash"}]
Performing remote grading: {"task":"llm-rubric","rubric":"","output":"幸福的家庭都一样。\n不幸的家庭各有各的不幸。 \n","vars":{"input":"Happy families are all alike. \nAnd every unhappy family is unhappy in its own way.","language":"Chinese"}}
Got remote grading result: {"pass":false,"reason":"No rubric was provided","score":0,"tokensUsed":{"total":201,"prompt":179,"completion":22,"cached":0}}

System information:

typpo commented 8 hours ago

Hi @strentom, definitely a bug. Just clarifying, are you running into this error with a redteam config? Based on the logs, you are hitting a redteam codepath which is different from the example provided and different from the behavior when I run it locally.

strentom commented 8 hours ago

Hi @typpo . I’m running ‘promptfoo eval -c script_above.yaml —verbose’ . I have other files and configs in the folder (incl. redteam related) but these should be ignored.

typpo commented 7 hours ago

Got it, this is helpful thanks. #1877 should fix the immediate issue. There is another issue of your redteam config being picked up when it shouldn't.