Possible workaround for the issues with multimodal...

joakimeriksson commented 7 months ago

Multimodal request seems to currently ignore the initial prompt - the multimodal example in this repo is not really working if you change the prompt. This is a changed version that seems to work - but includes a workaround:

import sys
import random
import httpx
from ollama import chat

latest = httpx.get('https://xkcd.com/info.0.json')
latest.raise_for_status()
num = random.randint(1, latest.json().get('num'))
comic = httpx.get(f'https://xkcd.com/{num}/info.0.json')
comic.raise_for_status()

print(f'xkcd #{comic.json().get("num")}: {comic.json().get("alt")}')
print(f'link: https://xkcd.com/{num}')
print('---')

raw = httpx.get(comic.json().get('img'))
raw.raise_for_status()

res = chat('llava', messages = [
  {         'role': 'user',
                    'images': [raw.content],
            'content': ' '
   },
  {         'role': 'user',
                    'images': [],
            'content': ' what is to the left in the image?'
  }
], stream=False)
print(res['message']['content'])

eliranwong commented 7 months ago

Thanks, I also noted aout the ignorance, but haven't thought about this workaround

mxyng commented 6 months ago

This should be fixed upstream in ollama by prepending the image tag rather than appending. This should be fixed by https://github.com/ollama/ollama/pull/2789 which is released in 0.1.28

ollama / ollama-python

Possible workaround for the issues with multimodal... #74