phidatahq / phidata

Build AI Agents with memory, knowledge, tools and reasoning. Chat with them using a beautiful Agent UI.
https://docs.phidata.com
Mozilla Public License 2.0
15.57k stars 2.14k forks source link

Vision model not working #1348

Open jacobweiss2305 opened 1 month ago

jacobweiss2305 commented 1 month ago

Hey team, this is hallucinating. Same behavior with OpenAILike

from phi.agent import Agent
from phi.model.openai import OpenAIChat

agent = Agent(
    model=OpenAIChat(id="gpt-4o"),
    markdown=True,
)

# Single Image
agent.print_response(
    [
        {"type": "text", "text": "What's in this image, describe in 1 sentence"},
        {
            "type": "image_url",
            "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
        },
    ]
)

# Multiple Images
agent.print_response(
    [
        {
            "type": "text",
            "text": "Is there any difference between these. Describe them in 1 sentence.",
        },
        {
            "type": "image_url",
            "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
        },
        {
            "type": "image_url",
            "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
        },
    ],
    markdown=True,
)
MANISH007700 commented 5 days ago

+1 tried both via loading an openimage link as well as passing an PIL image [modified code to support this] and also tried via base64 encoded images, modified code for this too : proposal issue # - 1460

in all the cases, the vision model was hallucinating. model gpt4o