[Question] How to use in-context learning in LLaVA?

Dinghaoxuan commented 8 months ago

Question

Hello, I want to input some in-context examples to LLaVA. But I can not find any guidance about how to insert images in input prompt. Could you give me some templates about multi-image input prompt? Thank you very much.

iamjudit commented 7 months ago

I'm also trying to send multiple images for a few-shot request via the pipeline. Thanks in advance

Deep1994 commented 7 months ago

I'm also trying to send multiple images for a few-shot request via the pipeline. Thanks in advance

hi, I want to know if you guys have found a solution?

Dinghaoxuan commented 7 months ago

I'm also trying to send multiple images for a few-shot request via the pipeline. Thanks in advance

hi, I want to know if you guys have found a solution?

I attempt to enclose each question and answer with symbols ~~your sentences~~ , but the in-context learning ability of LLaVA is poor. The in-context answers disturb the answer for query question.

Dinghaoxuan commented 7 months ago

I'm also trying to send multiple images for a few-shot request via the pipeline. Thanks in advance

I attempt to enclose each question and answer with symbols ~~your sentences~~ , but the in-context learning ability of LLaVA is poor. The in-context answers disturb the answer for query question.

ajaymin28 commented 5 months ago

Question

Hello, I want to input some in-context examples to LLaVA. But I can not find any guidance about how to insert images in input prompt. Could you give me some templates about multi-image input prompt? Thank you very much.

Ok, so I have got it done by making changes like this, its limited to two images, and maybe I'm not sure how many images can be added to this.

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX

from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image
import requests
import copy
import torch

pretrained = "lmms-lab/llama3-llava-next-8b"
model_name = "llava_llama_3"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args

model.eval()
model.tie_weights()

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

image_tensor = process_images([image], image_processor, model.config)
image_tensor_list = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor] # Jai: replace image_tensor by image_tensor_list that will be used to append second image

## Jai: add second image ["digitally altered image of a person standing in the water"]
image2 = Image.open("./LLaVA-NeXT/inputs/9f776e16-0d07-40d7-b2fd-45e23267f79b.jpg")
image_tensor2 = process_images([image2], image_processor, model.config)
for _image in image_tensor2:
    image_tensor_list.append(_image.to(dtype=torch.float16, device=device))

Instruction_COT = """There are two different images provided as an input, describe each of them independently""" 
conv_template = "llava_llama_3" # Make sure you use correct chat template for different models

question = DEFAULT_IMAGE_TOKEN + DEFAULT_IMAGE_TOKEN + f"\n{Instruction_COT}\n" # Jai: add second Image token in the question. By default there is only one.
conv = copy.deepcopy(conv_templates[conv_template]) 
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0) 
image_sizes = [image.size, image2.size] # Jai: add second image size here.

cont = model.generate(
    input_ids,
    images=image_tensor_list,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=2024,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)

Here's the output,

'\nThe image on the left appears to be a radar chart, also known as a spider chart or a web chart. This type of chart is used to display multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. Each axis represents a different variable, and the values are plotted along each axis and connected to form a polygon.\n\nThe radar chart in the image is labeled with various acronyms such as "MM-Vet," "LLaVA-Bench," "SEED-Bench," "MMBench-CN," "MMBench," "TextVQA," "POPE," "BLIP-2," "InstructionBLIP," "Owen-VL-Chat," and "LLaVA-1.5." These labels likely represent different benchmarks or models used in a particular context, possibly in the field of natural language processing or a related area of artificial intelligence.\n\nThe radar chart is color-coded, with different colors representing different models or benchmarks. The chart is overlaid with a blue background that seems to be a stylized representation of water, giving the impression that the radar chart is underwater.\n\nThe image on the right shows a person standing in what appears to be a body of water, possibly a pool or a shallow sea, given the clear visibility and the presence of bubbles. The person is wearing a black shirt and dark pants, and they are looking directly at the camera with a neutral expression. The water around them is a bright blue, and there are bubbles visible, suggesting that the water is clear and possibly that the person is underwater. The image is a photograph and has a realistic style.'

haotian-liu / LLaVA

[Question] How to use in-context learning in LLaVA? #1357

Question

Question