Open jlawman opened 4 days ago
I've run the example in https://github.com/promptfoo/promptfoo/tree/main/examples/claude-vision in verbose mode with promptfoo eval --verbose
and checked the logs. Sth. is messed up I think. Instead of nesting the messages
part (as determined by https://github.com/promptfoo/promptfoo/blob/main/examples/claude-vision/prompt.json) into the request which is sent to the Anthropic API the messages
part is handled as the text input and nested as such into the request's messages
part. Here are the jsons from the logs which begin with Calling Anthropic Messages API: <json object>
:
What I'd expect:
{
"model": "claude-3-haiku-20240307",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What’s in this image?"
},
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": "/9j/4AAQSkZJRgABAQEBLAEsAAD/4QBcRXhpZgAATU0AKgAAAAgAAYdpAAQAAA[...]Q==" // shortened base64 string for better readablity
}
}
]
}
],
"stream": "false",
"temperature": 0
}
Actual output:
{
"model": "claude-3-haiku-20240307",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "[\n{\n\"role\": \"user\",\n\"content\": [\n{\n\"type\": \"text\",\n\"text\": \"What’s in this image?\"\n},\n{\n\"type\": \"image\",\n\"source\": {\n\"type\": \"base64\",\n\"media_type\": \"image/jpeg\",\n\"data\": \"/9j/4AAQSkZJRgABAQEBLAEsAAD/4QBcRXhpZgAATU0AKgAAAAgAAYdpAAQAAA[...]Q==\"\n}\n}\n]\n}\n]" # shortened base64 string for better readablity
}
]
}
],
"stream": "false",
"temperature": 0
}
Hi @jlawman and @somogyijanos, thank you both for the detailed reports and for taking the time to investigate this issue! I’ll look into what’s going wrong with the handling of image content blocks and keep you updated on the progress.
Describe the bug When running evals on images with Anthropic or Bedrock the encoded image (base64 string) is passed as text to Claude instead of as an image block.
To Reproduce Run either of the claude vision examples:
Easiest way to reproduce is to use the claude-vision example and replace the base64 string with a different string https://github.com/promptfoo/promptfoo/blob/main/examples/claude-vision/prompt.json
The current example appears to work because Claude can actually handle the base64 as a text object. Below is an example chat window where I have asked Claude 3.5 Sonnet to read the base64 string as give a description (the same description as from running promptfoo eval with the claude-vision example).
However, images larger than the small example image don't work (e.g. 200kb in size), either due to Claude's text token limits or because Claude is not designed to read base64 as text.
Expected behavior The content blocks labeled as "image" with type "base64" to be passed as blocks of type "image" (opposed to current behavior where they appear to be parsed as regular message blocks of type "text").
Screenshots Example comparison of processing string as text vs as image block.
1) Image of juggling balls processed as text in Anthropic console
2) Image of juggling balls processed in promptfoo eval "The image appears to be a photograph of a person's face. The image shows a close-up view of a person's face, with their eyes, nose, and mouth visible."
3) Image of juggling balls processed as image block in Anthropic console (similar results possible with notebooks using non text content blocks)
System information: