Few shot format when use llava to generate Different language caption of image.

svjack commented 5 months ago

Hi, Thank you very much for providing such an attractive project. I try to use llava:13b to generate some caption in two language. I have the following roadmap:

!pip install datasets  ollama-instructor

from datasets import load_dataset
ds = load_dataset("svjack/pokemon-blip-captions-en-zh")
ds = ds["train"]

from ollama_instructor.ollama_instructor_client import OllamaInstructorClient
from pydantic import BaseModel, Field
from enum import Enum
from typing import List

import base64
from io import BytesIO

def im_to_str(image):
    buffered = BytesIO()
    image.save(buffered, format="JPEG")
    img_str = base64.b64encode(buffered.getvalue())
    return img_str

class Caption(BaseModel):
    en: str = Field(...,
            description="English caption of image"
        )
    zh: str =  Field(...,
            description="Chinese caption of image"
        )

hist = []
for i in range(8):
    hist.append(
        str(
            {"en": ds[i]["en_text"],
             "zh": ds[i]["zh_text"]}
        )
    )
hist_str = "\n".join(hist)
print(hist_str)

client = OllamaInstructorClient()
response = client.chat_completion_with_stream(
    model='llava:13b',
    pydantic_model=Caption,
    messages=[
            {
                "content": f'''
                You are a image to caption transformer,
                Describe the image content in English and Chinese respectly.
                while adhering to the following JSON schema: {Caption.model_json_schema()}
                following are some samples you should give. :
                {hist_str}
                ''',
                "role": "system"
            }
            ,{
                "content": "Describe the image in English and Chinese",
                "role": "user",
                "images": [im_to_str(ds[-1]["image"])]
            }
    ]
)

from IPython.display import clear_output
for chunk in response:
    clear_output(wait = True)
    print(chunk['message']['content'])

The image is

im0

And I have the output as

{'en': 'An angry Pokémon with claws out and eyes wide open, possibly preparing to battle or defend itself.', 'zh': '一只冷战蜘蛛，眼睛大张，可能准备进入战斗或防御状态。'}

This output is not accurate enough.

But when I use it in zero shot manner, it is accurate in sense but not meet the required semantic style or format. It output

{'en': 'Crab', 'zh': '蟹'}

Can you help it with me ?😊

And can you open a discord channel about this project, that we can improve the project together, and I'm interested with the unfinished meeting examples in the example dir, If it will related with group chat in the meeting, will be amazing. 😊

lennartpollvogt commented 5 months ago

Hi @svjack

Looks interesting! I would like to help you with that. I was trying similar things last weekend 😅 I hope I get this done the next days. But to help you I would like to have some questions answered:

For your provided image, what would be your desired output? Could you give an example and a description of what you expect?

I am not familiar with Discord but I will have a look what are the benefits. And yes, the unfinished example was intended to provide a use case for meetings. As I am currently working on the async Client, this will have to wait. But it is indeed one of my desired use cases 👍🏼

svjack commented 5 months ago

Hi @svjack

Looks interesting! I would like to help you with that. I was trying similar things last weekend 😅 I hope I get this done the next days. But to help you I would like to have some questions answered:

For your provided image, what would be your desired output? Could you give an example and a description of what you expect?

I am not familiar with Discord but I will have a look what are the benefits. And yes, the unfinished example was intended to provide a use case for meetings. As I am currently working on the async Client, this will have to wait. But it is indeed one of my desired use cases 👍🏼

Thank you for your reply. The dataset I used is localed in https://huggingface.co/datasets/svjack/pokemon-blip-captions-en-zh, I want to to image column to produce en_text and zh_text column with few shot inference with the help of Ollama's llava:13b The Input and output format seems like following:

Input:

Output:

{'en': 'a red and white ball with an angry look on its face', 'zh': '一个红白相间的球，脸上带着愤怒的表情'}

The output format and Syntax format may similar as this :

{'en': 'a drawing of a green pokemon with red eyes', 'zh': '红眼睛的绿色小精灵的图画'}
{'en': 'a green and yellow toy with a red nose', 'zh': '黄绿相间的红鼻子玩具'}
{'en': 'a red and white ball with an angry look on its face', 'zh': '一个红白相间的球，脸上带着愤怒的表情'}
{'en': "a cartoon ball with a smile on it's face", 'zh': '笑容满面的卡通球'}
{'en': 'a bunch of balls with faces drawn on them', 'zh': '一堆画着脸的球'}
{'en': 'a cartoon character with a potted plant on his head', 'zh': '一个头上戴着盆栽的卡通人物'}
{'en': 'a drawing of a pokemon stuffed animal', 'zh': '小精灵毛绒玩具的图画'}
{'en': 'a picture of a cartoon character with a sword', 'zh': '一张带剑的卡通人物图片'}
{'en': 'a drawing of a cartoon character laying on the ground', 'zh': '一个卡通人物躺在地上的图画'}

To focus on item category and color with some simple action description. 😊

lennartpollvogt commented 5 months ago

So, I made some adjustments to the prompt and provided a temperature option, to make the model less "creative". Here is, what worked fine, when making some iterations:

response = client.chat_completion(
    model='llava:13b',
    pydantic_model=Caption,
    messages=[
            {
                "content": f'''
                You are a highly accurate image to caption transformer.
                Describe the image content in English and Chinese respectly. Make sure to FOCUS on item CATEGORY and COLOR!
                Do NOT provide NAMES! KEEP it SHORT!
                While adhering to the following JSON schema: {Caption.model_json_schema()}
                following are some samples you should give. :
                {hist_str}
                ''',
                "role": "system"
            }
            ,{
                "content": "Describe the image in English and Chinese",
                "role": "user",
                "images": [im_to_str(ds[-1]["image"])]
            }
    ],
    options={
        "temperature": 0.4,
    }
)

Here are some outputs:

{'en': 'A cartoon crab with a surprised expression', 'zh': '一个惊讶的卡通蟹'}

{'en': 'A cartoon crab with a surprised expression.', 'zh': '一个惊讶的卡通蟹。'}

{'en': 'A cartoon crab with a surprised expression on its face.', 'zh': '一个惊讶的卡通蟹。'}

{'en': 'A cartoon depiction of a red and white Pokémon with a surprised expression.','zh': '一张插图，展示了一只红白相间的小精灵，表情吓人。'}

{'en': 'A red and white Pokemon with a surprised expression.','zh': '一只红色和白色的小精灵，表情惊讶。'}

{ 'en': 'A small, orange and white Pokémon with a surprised expression.','zh': '一个小型的橙色和白色宝可梦，表情吓了。'}

In some cases it was responding a to long description:

{'en': "This is a Pokémon. It has orange and white colors on its body, with a red 
underside. The Pokémon's eyes are wide open and it seems to be making an angry or surprised 
expression.", 'zh': '这是一只宝可梦。它身上有橙色和白色的颜色，下面是红色。宝可梦的眼睛张开，看起来像在生气或惊讶。'}

{'en': "This is a Pokémon, specifically the character known as Crawdaunt. It's a 
crustacean-like creature with orange and white coloration. The Pokémon has two large claws on its 
front legs and appears to be in a defensive or aggressive posture.", 'zh': '这是一只宝可梦，具体是名为Crawdaunt的角虫类生物。其颜色是橙色和白色。这个宝可梦有两只大的前脚刺，看
起来像在进行防御或攻击。'}

{ 'en': 'A cartoon depiction of a crab with a surprised or shocked expression, sitting on 
its hind legs. It has a vibrant orange shell and large eyes that are wide open.', 'zh': '一个插画的螃蟹，它在后脚上坐着，表情吓唬或震惊。它有一个鲜艳的橙色外壳和大大的眼睛张开。'}

Does this help? You can try to play with the temperature or the system prompt.

Another approach you could try: Instruct a LLM (e.g. phi3) to validate the image description, if it meets the instructions you provided within the system prompt and generate a processable output like:

class Length(Enum):
    short = 1
    medium = 2
    long = 3

class Relevance(Enum):
    TRUE = True
    FALSE = False

class ImageDescriptionClassifier(BaseModel):
    length: Length
    relevance: Relevance

Or you set a limit for the fields "en" and "zh". If the response of a description (e.g. from "en") reaches the limit, the validation will run into an exception and starts a retry (which is a feature of ollama-instructor).

svjack commented 5 months ago

So, I made some adjustments to the prompt and provided a temperature option, to make the model less "creative". Here is, what worked fine, when making some iterations:

response = client.chat_completion(
    model='llava:13b',
    pydantic_model=Caption,
    messages=[
            {
                "content": f'''
                You are a highly accurate image to caption transformer.
                Describe the image content in English and Chinese respectly. Make sure to FOCUS on item CATEGORY and COLOR!
                Do NOT provide NAMES! KEEP it SHORT!
                While adhering to the following JSON schema: {Caption.model_json_schema()}
                following are some samples you should give. :
                {hist_str}
                ''',
                "role": "system"
            }
            ,{
                "content": "Describe the image in English and Chinese",
                "role": "user",
                "images": [im_to_str(ds[-1]["image"])]
            }
    ],
    options={
        "temperature": 0.4,
    }
)

Here are some outputs:

{'en': 'A cartoon crab with a surprised expression', 'zh': '一个惊讶的卡通蟹'}

{'en': 'A cartoon crab with a surprised expression.', 'zh': '一个惊讶的卡通蟹。'}

{'en': 'A cartoon crab with a surprised expression on its face.', 'zh': '一个惊讶的卡通蟹。'}

{'en': 'A cartoon depiction of a red and white Pokémon with a surprised expression.','zh': '一张插图，展示了一只红白相间的小精灵，表情吓人。'}

{'en': 'A red and white Pokemon with a surprised expression.','zh': '一只红色和白色的小精灵，表情惊讶。'}

{ 'en': 'A small, orange and white Pokémon with a surprised expression.','zh': '一个小型的橙色和白色宝可梦，表情吓了。'}

In some cases it was responding a to long description:

{'en': "This is a Pokémon. It has orange and white colors on its body, with a red 
underside. The Pokémon's eyes are wide open and it seems to be making an angry or surprised 
expression.", 'zh': '这是一只宝可梦。它身上有橙色和白色的颜色，下面是红色。宝可梦的眼睛张开，看起来像在生气或惊讶。'}

{'en': "This is a Pokémon, specifically the character known as Crawdaunt. It's a 
crustacean-like creature with orange and white coloration. The Pokémon has two large claws on its 
front legs and appears to be in a defensive or aggressive posture.", 'zh': '这是一只宝可梦，具体是名为Crawdaunt的角虫类生物。其颜色是橙色和白色。这个宝可梦有两只大的前脚刺，看
起来像在进行防御或攻击。'}

{ 'en': 'A cartoon depiction of a crab with a surprised or shocked expression, sitting on 
its hind legs. It has a vibrant orange shell and large eyes that are wide open.', 'zh': '一个插画的螃蟹，它在后脚上坐着，表情吓唬或震惊。它有一个鲜艳的橙色外壳和大大的眼睛张开。'}

Does this help? You can try to play with the temperature or the system prompt.

Another approach you could try: Instruct a LLM (e.g. phi3) to validate the image description, if it meets the instructions you provided within the system prompt and generate a processable output like:

class Length(Enum):
    short = 1
    medium = 2
    long = 3

class Relevance(Enum):
    TRUE = True
    FALSE = False

class ImageDescriptionClassifier(BaseModel):
    length: Length
    relevance: Relevance

Or you set a limit for the fields "en" and "zh". If the response of a description (e.g. from "en") reaches the limit, the validation will run into an exception and starts a retry (which is a feature of ollama-instructor).

Thanks for your reply, as you say, the fields constration can also help like edit schema as

from pydantic import BaseModel, constr
class Caption(BaseModel):
    en: constr(max_length=128) = Field(...,
            description="English caption of image"
        )
    zh: constr(max_length=64, pattern=r'精灵') =  Field(...,
            description="Chinese caption of image"
        )

Get the required results. 😊

lennartpollvogt commented 5 months ago

Btw. would it be possible to use this thread and part of your provided code to create a guide within the docs folder for using multimodal models with ollama-instructor?

svjack commented 5 months ago

Btw. would it be possible to use this thread and part of your provided code to create a guide within the docs folder for using multimodal models with ollama-instructor?

Of course, by the way, does ollama-instructor provide function-calling functions and examples? 😊

lennartpollvogt commented 5 months ago

Thank you.

Do you mean with the term "function calling" that what OpenAI is doing? Isn't that the same a asking the LLM to response in a certain structure (like JSON) and process the output within a function? Or more like let the LLM choose which function/s it should "call" for a certain request or within the context?

If you mean the second, then the answer is not. Not completely no, but you could try to build it on top of ollama-instructor. Currently I am working on such an approach.😉

lennartpollvogt / ollama-instructor

Few shot format when use llava to generate Different language caption of image. #2