haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
19.17k stars 2.1k forks source link

[Usage] How can I implemet few shot learning on LLaVa #1202

Open htluandc2 opened 6 months ago

htluandc2 commented 6 months ago

Describe the issue

Hi there,

I have some images and some custom explain. So I want to implement few shot learning to make summaries of my images.

This is my current implement:

templates = [
    {
        "url": "",
        "explain": """""",
    },
    {
        "url": "",
        "explain": """""",
    },
    {
        "url": "",
        "explain": """"""
    },
    {
        "url": ",
        "explain": """"""
    },
    {
        "url": "",
        "explain": """"""
    },
]

My code to build prompt:

from PIL import Image
import cv2
import numpy as np
import requests

"""Make image summary"""
img_prompt = "User: <image>\n"+"\nASSISTANT:"

prompt = (
    "You are an assistant tasked with summarizing images for retrieval. "
    "These summaries will be embedded and used to retrieve the raw image. "
    "Give a concise summary of the image that is well optimized for retrieval."
)
print(prompt)

images = []

for i, temp in enumerate(templates):
    image_i = Image.open(requests.get(temp['url'], stream=True).raw)
    eplain_i  = temp["explain"]
    example_i = f"\nUser: <image{i}>"+"\nASSISTANT:" + eplain_i + "\n"
    prompt += example_i
    images.append(image_i)

prompt += f"\nUser: <image{len(templates)}>"+"\nASSISTANT:"
print(prompt)
print('-'*100)
print("Examples:", len(images))

Inference:

target = Image.open("figures/figure-2-5.jpg")

out = model_multi_modals(
    images=images+[target],
    prompt=prompt,
    generate_kwargs={"max_new_tokens": 2048})

And my error:

ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.
leeyyi commented 6 months ago

In-context learning or fine tuning

Nomiluks commented 6 months ago

That's an excellent question. Similar to OpenAI GPT models, we can enhance them through a few-shot approach. It would be fantastic if we could apply the same method to these pre-trained models. @haotian-liu

fisher75 commented 5 months ago

Is it solved? Because I use SGLang for batch inference and I also need this feature for ICL and multiple discussions or few shot.

Debolena7 commented 3 months ago

image{len(templates)}

Describe the issue

Hi there,

I have some images and some custom explain. So I want to implement few shot learning to make summaries of my images.

This is my current implement:

templates = [
    {
        "url": "",
        "explain": """""",
    },
    {
        "url": "",
        "explain": """""",
    },
    {
        "url": "",
        "explain": """"""
    },
    {
        "url": ",
        "explain": """"""
    },
    {
        "url": "",
        "explain": """"""
    },
]

My code to build prompt:

from PIL import Image
import cv2
import numpy as np
import requests

"""Make image summary"""
img_prompt = "User: <image>\n"+"\nASSISTANT:"

prompt = (
    "You are an assistant tasked with summarizing images for retrieval. "
    "These summaries will be embedded and used to retrieve the raw image. "
    "Give a concise summary of the image that is well optimized for retrieval."
)
print(prompt)

images = []

for i, temp in enumerate(templates):
    image_i = Image.open(requests.get(temp['url'], stream=True).raw)
    eplain_i  = temp["explain"]
    example_i = f"\nUser: <image{i}>"+"\nASSISTANT:" + eplain_i + "\n"
    prompt += example_i
    images.append(image_i)

prompt += f"\nUser: <image{len(templates)}>"+"\nASSISTANT:"
print(prompt)
print('-'*100)
print("Examples:", len(images))

Inference:

target = Image.open("figures/figure-2-5.jpg")

out = model_multi_modals(
    images=images+[target],
    prompt=prompt,
    generate_kwargs={"max_new_tokens": 2048})

And my error:

ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.

I think The error is because of the image token. In the prompt, the image token should be given as:

<image>

and not by image id or image index. I got a similar error in my setup for multi-prompt.

BTW, the model is not capable of performing directly on multiple images and prompts simultaneously, as is evident from the following conversations by the author and others.

https://discuss.huggingface.co/t/llava-multi-image-input-support-for-inference/68458

https://github.com/haotian-liu/LLaVA/issues/197#:~:text=Due%20to%20the%20current%20way%20of%20training%2C%20we%20do%20not%20observe%20the%20model%20having%20very%20good%20capability%20referring%20to%20/%20comparing%20with%20multiple%20images.%20We%20are%20working%20on%20improving%20this%20aspect%20as%20well%2C%20stay%20tuned!

https://github.com/haotian-liu/LLaVA/issues/57#:~:text=Due%20to%20the%20current%20way%20of%20training%2C%20we%20do%20not%20observe%20the%20model%20having%20very%20good%20capability%20referring%20to%20/%20comparing%20with%20multiple%20images.

https://huggingface.co/YouLiXiya/tinyllava-v1.0-1.1b-hf/discussions/1#:~:text=The%20training%20is%20based%20on%20a%20single%20image.%20Multiple%20images%20are%20not%20supported

ys-zong commented 1 month ago

Hi guys, you can use our implemented codebase for ICL. https://github.com/ys-zong/VL-ICL