haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.25k stars 2.24k forks source link

How to provide 2 or more images to the model and compare the beauty of the images? #874

Open xinsir6 opened 11 months ago

xinsir6 commented 11 months ago

Question

I have a problem when use llava to process multi images in the same time, such as, give the model 2 or more images, and ask it to answer questions about the images, like which one do you like better? The web demo and chat can't solve this problem, so could you provide a special scripts to do this?

mapluisch commented 11 months ago

You might try putting two images into one by separating them with a horizontal line and including that detail in your prompt. I haven't tested this method extensively, but it seemed to work quite well in a quick trial I did:

fused_image

If this works well enough for you, you could adjust the scripts to accept two images which then get merged in this way. I might create a quick CLI demo.

xinsir6 commented 11 months ago

good idea, I will try it, the only problem is that the method requires two images to be resized into same width/height

mapluisch commented 11 months ago

good idea, I will try it, the only problem is that the method requires two images to be resized into same width/height

I've made a simple proof of concept for you based on cli.py:

https://github.com/mapluisch/LLaVA-CLI-with-multiple-images

It doesn't resize (or truncate / crop) the images, just concatenates them vertically.

xinsir6 commented 11 months ago

good idea, I will try it, the only problem is that the method requires two images to be resized into same width/height

I've made a simple proof of concept for you based on cli.py:

https://github.com/mapluisch/LLaVA-CLI-with-multiple-images

It doesn't resize (or truncate / crop) the images, just concatenates them vertically.

image image I try to ask the model as the way you request, but the model refuse to reply, is there any way to solve it ?

mapluisch commented 11 months ago

You should play around with different prompts and temperatures. I'm using 4-bit quantization on the 13b model, and this prompt works:

Analyze the two images and tell me which one is better and why

yuejunpeng commented 5 months ago

@mapluisch Hello! It is true that you don’t need to resize the two images when you just concatenate them vertically, but I think the model does resize them to fit when feeding them to the clip image encoder.