promptfoo / promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
https://promptfoo.dev
MIT License
4.32k stars 320 forks source link

Images are passed as text to Anthropic & Bedrock #1750

Open jlawman opened 4 days ago

jlawman commented 4 days ago

Describe the bug When running evals on images with Anthropic or Bedrock the encoded image (base64 string) is passed as text to Claude instead of as an image block.

To Reproduce Run either of the claude vision examples:

Easiest way to reproduce is to use the claude-vision example and replace the base64 string with a different string https://github.com/promptfoo/promptfoo/blob/main/examples/claude-vision/prompt.json

The current example appears to work because Claude can actually handle the base64 as a text object. Below is an example chat window where I have asked Claude 3.5 Sonnet to read the base64 string as give a description (the same description as from running promptfoo eval with the claude-vision example).

Screenshot 2024-09-24 at 11 25 04

However, images larger than the small example image don't work (e.g. 200kb in size), either due to Claude's text token limits or because Claude is not designed to read base64 as text.

Expected behavior The content blocks labeled as "image" with type "base64" to be passed as blocks of type "image" (opposed to current behavior where they appear to be parsed as regular message blocks of type "text").

Screenshots Example comparison of processing string as text vs as image block.

1) Image of juggling balls processed as text in Anthropic console Screenshot 2024-09-24 at 11 29 56

2) Image of juggling balls processed in promptfoo eval "The image appears to be a photograph of a person's face. The image shows a close-up view of a person's face, with their eyes, nose, and mouth visible."

3) Image of juggling balls processed as image block in Anthropic console (similar results possible with notebooks using non text content blocks) Screenshot 2024-09-24 at 11 30 56

System information:

somogyijanos commented 4 days ago

I've run the example in https://github.com/promptfoo/promptfoo/tree/main/examples/claude-vision in verbose mode with promptfoo eval --verbose and checked the logs. Sth. is messed up I think. Instead of nesting the messages part (as determined by https://github.com/promptfoo/promptfoo/blob/main/examples/claude-vision/prompt.json) into the request which is sent to the Anthropic API the messages part is handled as the text input and nested as such into the request's messages part. Here are the jsons from the logs which begin with Calling Anthropic Messages API: <json object>:

What I'd expect:

{
    "model": "claude-3-haiku-20240307",
    "max_tokens": 1024,
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What’s in this image?"
                },
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/jpeg",
                        "data": "/9j/4AAQSkZJRgABAQEBLAEsAAD/4QBcRXhpZgAATU0AKgAAAAgAAYdpAAQAAA[...]Q=="  // shortened base64 string for better readablity
                    }
                }
            ]
        }
    ],
    "stream": "false",
    "temperature": 0
}

Actual output:

{
    "model": "claude-3-haiku-20240307",
    "max_tokens": 1024,
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "[\n{\n\"role\": \"user\",\n\"content\": [\n{\n\"type\": \"text\",\n\"text\": \"What’s in this image?\"\n},\n{\n\"type\": \"image\",\n\"source\": {\n\"type\": \"base64\",\n\"media_type\": \"image/jpeg\",\n\"data\": \"/9j/4AAQSkZJRgABAQEBLAEsAAD/4QBcRXhpZgAATU0AKgAAAAgAAYdpAAQAAA[...]Q==\"\n}\n}\n]\n}\n]"  # shortened base64 string for better readablity
                }
            ]
        }
    ],
    "stream": "false",
    "temperature": 0
}
mldangelo commented 4 days ago

Hi @jlawman and @somogyijanos, thank you both for the detailed reports and for taking the time to investigate this issue! I’ll look into what’s going wrong with the handling of image content blocks and keep you updated on the progress.