haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.01k stars 2.2k forks source link

[Usage] bug on https://llava.hliu.cc/ #105

Open jiangying000 opened 1 year ago

jiangying000 commented 1 year ago

When did you clone our code?

I cloned the code base after 5/1/23

Describe the issue

Issue:

At 2nd message, I input turtle image, but answer is about 1st image

Screenshots: image

jiangying000 commented 1 year ago

this issue is steadily reproducible

jiangying000 commented 1 year ago

Sometimes demo is mixing two image's content image

haotian-liu commented 1 year ago

Hi @jiangying000, thank you for your interest in our work.

Due to the current way of training (only a single image in a conversation), we do not observe the model having very good capability referring to / comparing with multiple images. You may refer to the discussion / examples in this thread as well: #57.

We are working on improving this aspect as well, stay tuned!