haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
18.08k stars 1.97k forks source link

LLava always speaks of 2 images #1591

Open DeWolfRobin opened 1 week ago

DeWolfRobin commented 1 week ago

Describe the issue

Issue: I have LLava running via ollama and a python script sending screenshots to it. It's meant to help my blind mother have a description of what's on screen. Whenever I run the script, the model speaks of 2 images, with both being similar to the screenshot with some descrepancies. Is this a hallucination?

Python code:

import keyboard
import pyttsx3
from PIL import ImageGrab
import requests
import base64
from io import BytesIO
import json

def capture_screenshot():
    return ImageGrab.grab()

def describe_image(image):
    buffer = BytesIO()
    image.save(buffer, format="JPEG")
    img_str = base64.b64encode(buffer.getvalue())
    payload = json.dumps({
        "model": "llava",
        "prompt": "This is a screenshot from a Windows PC. Your job is to describe the contents of the screenshot for the user. The description is for a visualy impaired or blind person. the description should be like that of another person telling the blind person about what they see in front of them. You can ignore any windows elements if they are not the main focus.",
        "images": [img_str.decode("utf-8")],
        "stream": False
    })
    r = requests.post('http://localhost:11434/api/generate', data=payload)
    return r.json()["response"]

def narrate_description(description):
    engine = pyttsx3.init()
    engine.say(description)
    engine.runAndWait()

def main():
    print("Press the 'home' key to capture a screenshot and get its description.")
    while True:
        if keyboard.is_pressed('home'):
            print("Home key pressed, capturing screenshot...")
            print("Screenshot captured, describing the image...")
            description = describe_image(capture_screenshot())
            narrate_description(description)
            print("Press the 'home' key to capture another screenshot.")

if __name__ == "__main__":
    main()
DeWolfRobin commented 1 week ago

I think I found the solution. i had "ollama run llava", which is the 7b model version 1.6 and the max resolution there is 672x672, 336x1344, 1344x336, so it probably split the image into separate images.