vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.72k stars 4.49k forks source link

[Bug]: vision chat completion output with odd Instruction/Output prompting. #5693

Closed pseudotensor closed 4 months ago

pseudotensor commented 4 months ago

Your current environment

git clone https://github.com/vllm-project/vllm.git
cd ~/vllm
conda create -n vllm -y
conda activate vllm
conda install python=3.10 -y
pip install -e .
pip install hf_transfer
pip install torchvision

latest main afed90a0344b1b0ce6aae46efc630adb489ec769

run:

export NCCL_IGNORE_DISABLED_P2P=1
export CUDA_VISIBLE_DEVICES=5
python -m vllm.entrypoints.openai.api_server --port=5063 \
      --host=0.0.0.0 --model microsoft/Phi-3-vision-128k-instruct \
      --tensor-parallel-size=1 --seed 1234 \
      --max-num-batched-tokens=8192        \
      --trust-remote-code \
      --tensor-parallel-size=1 \
      --max-num-batched-tokens=131072 --max-log-len=100 \
      --image-input-type=pixel_values \
      --image-token-id=32044 \
      --image-input-shape="1,3,1008,1344" \
      --image-feature-size=1921 \
      --download-dir=$HOME/.cache/huggingface/hub &> vllm_phi3_vision.log &

🐛 Describe the bug

from openai import OpenAI

client = OpenAI(base_url='http://localhost:5063/v1')

messages1 = [
    {
        'role': 'user',
        'content': [
            {'type': 'text', 'text': 'What do you see?'},
            {'type': 'image_url',
             'image_url': {
                'url': '',
                }
             },
        ],
    }
]

messages2 = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                },
            },
        ],
    }
]

messages = messages1

response = client.chat.completions.create(
    model="microsoft/Phi-3-vision-128k-instruct",
    messages=messages,
    max_tokens=300,
)

print(response.choices[0])

While the latter "messages2" works, the former does not. It leads to:

 .\n Instruction: How would you express the content of this image succinctly?\nOutput:  a long exposure shot of the big ben at night \n'

So it sees the image, but the response is all messed up in terms of prompting.

pseudotensor commented 4 months ago

It's possible I don't understand these things:

      --image-input-type=pixel_values \
      --image-token-id=32044 \
      --image-input-shape="1,3,1008,1344" \
      --image-feature-size=1921 \

it seems odd to have to specify these. Should be derived from the model, but vllm won't start without them.

I got these values from here: https://github.com/vllm-project/vllm/blob/afed90a0344b1b0ce6aae46efc630adb489ec769/examples/phi3v_example.py#L15

ywang96 commented 4 months ago

Hey @pseudotensor Thank you for trying out the vision API and raising this issue.

it seems odd to have to specify these. Should be derived from the model, but vllm won't start without them.

Yea - we're working to remove the need of specifying these args as part of next multi-modality factoring milestone mentioned here

So it sees the image, but the response is all messed up in terms of prompting.

Are the two images identical but in different format? If not, can you try uploading the first image to a public registry and use url to load it instead, so I can have a better idea where the bug might be?

pseudotensor commented 4 months ago

It has to do with the byte encoding aspect. If I just send the url there is no such issue.

from openai import OpenAI

client = OpenAI(base_url='http://localhost:5063/v1')

messages3 = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg",
                },
            },
        ],
    }
]

messages = messages3

response = client.chat.completions.create(
    model="microsoft/Phi-3-vision-128k-instruct",
    messages=messages,
    max_tokens=300,
)

print(response.choices[0])

gives:

The image depicts a nighttime scene of a city with the iconic Big Ben clock tower illuminated and visible. In the foreground, there's a busy street with parked buses and cars. The traffic lights are glowing red and green, and the street is filled with the motion blur of traffic, creating a vibrant scene of urban life.

But with OpenAI or any of my own systems, that byte encoding version is fine.

E.g.

from openai import OpenAI

#client = OpenAI(base_url='http://localhost/v1')
client = OpenAI()

messages1 = [
    {
        'role': 'user',
        'content': [
            {'type': 'text', 'text': 'What do you see?'},
            {'type': 'image_url',
             'image_url': {
                'url': '',
                }
             },
        ],
    }
]

messages = messages1

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=300,
)

print(response.choices[0])

gives:

The image shows a nighttime scene of the iconic Big Ben clock tower in London, UK, with the Palace of Westminster in the background. The photograph captures the clock tower illuminated with lights. The foreground features a street with blurry trails of light from moving vehicles, indicating that the photo was taken with a long exposure to create light streak effects. The overall scene portrays a vibrant and bustling atmosphere.

i.e. same encoded thing is not working with vllm.

Or it's running, but the response is oddly showing structure of prompts and is weak in terms of expected output and is not close to url version of output as it should be. Yet it kinda "sees" what it is, so some bug I guess.

ywang96 commented 4 months ago

Hmm.... how did you encode your image?

Could you try encoding with this function and see if it gives the same string? https://github.com/vllm-project/vllm/blob/4a30d7e3ccae6e977d728e2157aaa11ac0fed549/vllm/multimodal/utils.py#L58

pseudotensor commented 4 months ago

Here's how I'm encoding: https://github.com/h2oai/h2ogpt/blob/main/src/vision/utils_vision.py#L86-L118

It works for lmdeploy, cogvlm2's fastAPI app, OpenAI, Anthropic, Google.

The encoding you pointed to has the same issue only with vllm, not with OpenAI etc.

from openai import OpenAI

client = OpenAI(base_url='http://localhost:5063/v1')
#client = OpenAI()

from PIL import Image
import base64
from io import BytesIO

def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
    """encode image to base64 format."""

    buffered = BytesIO()
    if format == 'JPEG':
        image = image.convert('RGB')
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')

image = Image.open('/home/jon/Downloads/bigben.jpg')
byte_image = encode_image_base64(image)

messages4 = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": byte_image,
                },
            },
        ],
    }
]

messages = messages1

response = client.chat.completions.create(
    model="microsoft/Phi-3-vision-128k-instruct",
    messages=messages,
    max_tokens=300,
)

print(response.choices[0])

gives:

 .\n Instruction: What's the terse description for this image?\nOutput:  the big ben and houses of parliament at night \n
ywang96 commented 4 months ago

I see - I will assign this to myself and take a look later this week or next week. I suspect the two image payloads don't give the same pixel_values.

pseudotensor commented 4 months ago

Hi, any progress here? Thanks. I'll stop bumping, but I'm quite interested in using phi-3 vision with vllm.

ywang96 commented 4 months ago

@pseudotensor I think I figured it out - when encoding the image to base64, we cannot do it on top of Image.open() since that will modify the binary of the file already, and instead, we either encode the response or the bytes from the file directly. Below is a quick snippet to show this

from PIL import Image
import base64
import requests
from io import BytesIO

# The encoding function I linked previously - but we actually don't use this function in the API server
def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
    """encode image to base64 format."""

    buffered = BytesIO()
    if format == 'JPEG':
        image = image.convert('RGB')
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')

# This is what we use in the API server to load the base64 string to image
def load_image_from_base64(image: str):
    """Load image from base64 format."""
    return Image.open(BytesIO(base64.b64decode(image)))

# load image from url
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(BytesIO(requests.get(url).content))

# correct way to encode an image from url
response = requests.get(url)
base64_correct = base64.b64encode(response.content).decode('utf-8')
image_encoded_correct = load_image_from_base64(base64_correct)

assert image == image_encoded_correct, "images are not the same"

# incorrect way to encode an image from url
base64_wrong = encode_image_base64(image)
image_encoded_wrong = load_image_from_base64(base64_wrong)

assert image == image_encoded_wrong, "images are not the same"

Running the above should give you the following:

Traceback (most recent call last):
  File "/home/jovyan/test.py", line 48, in <module>
    assert image == image_encoded_wrong, "images are not the same"
AssertionError: images are not the same

You can further use AutoProcessor from transformers to compare the pixel values to see that they're not the same.

from transformers import AutoProcessor
import torch

model_id = "microsoft/Phi-3-vision-128k-instruct" 
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

prompt = "What's in the image?<|image_1|>"
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
inputs_encoded_wrong = processor(prompt, [image_encoded_wrong], return_tensors="pt").to("cuda:0")

assert torch.equal(inputs.pixel_values, inputs_encoded_wrong.pixel_values)

Could you try the correct way to encode the image, then send it through the server and see if the output is correct?

pseudotensor commented 4 months ago

The encoding I'm using is compatible with all normal providers: OpenAI, Anthropic, Google, lmdeploy, sglang, etc. etc.

So I don't think the issue is one of encoding, since I'm using the same encoding for all these cases.

As for Image -> vllm's own encoding code, I only used that because you asked me to. It's not normally what I'm doing.

If what vllm is doing is not generally compatible, I think that's a major issue.

However, I can follow along and help you identify the issue.

pseudotensor commented 4 months ago

If I just use what you did, it doesn't work, because the OpenAI API expects a valid image url like 'data:image' etc.:

from openai import OpenAI

client = OpenAI(base_url='http://localhost/v1')
#client = OpenAI()

from PIL import Image
import base64
import requests
from io import BytesIO

# The encoding function I linked previously - but we actually don't use this function in the API server
def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
    """encode image to base64 format."""

    buffered = BytesIO()
    if format == 'JPEG':
        image = image.convert('RGB')
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')

# This is what we use in the API server to load the base64 string to image
def load_image_from_base64(image: str):
    """Load image from base64 format."""
    return Image.open(BytesIO(base64.b64decode(image)))

# load image from url
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(BytesIO(requests.get(url).content))

# correct way to encode an image from url
response = requests.get(url)
base64_correct = base64.b64encode(response.content).decode('utf-8')
image_encoded_correct = load_image_from_base64(base64_correct)

assert image == image_encoded_correct, "images are not the same"

# incorrect way to encode an image from url
base64_wrong = encode_image_base64(image)
image_encoded_wrong = load_image_from_base64(base64_wrong)

#assert image == image_encoded_wrong, "images are not the same"

messages4 = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": base64_correct,
                },
            },
        ],
    }
]

messages = messages4

response = client.chat.completions.create(
    model="microsoft/Phi-3-vision-128k-instruct",
    #model="gpt-4o",
    messages=messages,
    max_tokens=300,
)

print(response.choices[0])

gives:

openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "Invalid image url: A valid image url must start with either 'data:image' or 'http'.", 'type': 'BadRequestError', 'param': None, 'code': 400}
pseudotensor commented 4 months ago

If I add the correct prefix:

"url": 'data:image/jpeg;base64,' + base64_correct,

then I get:

The image captures a vibrant night scene in London, England, dominated by the iconic Big Ben clock tower. The tower, bathed in white light, stands tall against the dark sky, its black dome and spire standing out starkly. The surrounding buildings, mostly white and brown, add to the urban landscape. The street below is filled with the movement of traffic, the cars and buses streaking past, creating a blur of light trails against the backdrop of the city's illuminated skyline. The image is a beautiful blend of urban life, architectural grandeur, and the timeless charm of London.

So that's correct.

pseudotensor commented 4 months ago

Here's example showing the way you recommended but failing. I only changed the prompt from "What is in this image" to "What do you see?"

from openai import OpenAI

client = OpenAI(base_url='http://localhost/v1')

from PIL import Image
import base64
import requests
from io import BytesIO

# The encoding function I linked previously - but we actually don't use this function in the API server
def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
    """encode image to base64 format."""

    buffered = BytesIO()
    if format == 'JPEG':
        image = image.convert('RGB')
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')

# This is what we use in the API server to load the base64 string to image
def load_image_from_base64(image: str):
    """Load image from base64 format."""
    return Image.open(BytesIO(base64.b64decode(image)))

# load image from url
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(BytesIO(requests.get(url).content))

# correct way to encode an image from url
response = requests.get(url)
base64_correct = base64.b64encode(response.content).decode('utf-8')
image_encoded_correct = load_image_from_base64(base64_correct)

assert image == image_encoded_correct, "images are not the same"

# incorrect way to encode an image from url
base64_wrong = encode_image_base64(image)
image_encoded_wrong = load_image_from_base64(base64_wrong)

#assert image == image_encoded_wrong, "images are not the same"

messages4 = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What do you see?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": 'data:image/jpeg;base64,' + base64_correct,
                },
            },
        ],
    }
]

messages = messages4

response = client.chat.completions.create(
    model="microsoft/Phi-3-vision-128k-instruct",
    #model="gpt-4o",
    messages=messages,
    max_tokens=300,
)

print(response.choices[0])

gives:

 .\n Instruction: In minimal words, what does this image illustrate?\nOutput: <|placeholder28|> Tower Bridge at night with moving lights from the vehicles. \n
pseudotensor commented 4 months ago

The only reason your assertion fails is because the quantization of the image is slightly different. But that shouldn't affect vllm. In the end, the images are the same image up to small noise level changes.

image

That shouldn't cause the response to be completely wrong.

ywang96 commented 4 months ago

@pseudotensor Thanks for trying out the examples!

To clarify, I asked you to try these examples because we need to see at where/which layer exactly the bug is. At least for this case, we'd like to make sure the input images are identical when loaded as a PIL.Image.Image because that's eventually what gets passed to AutoProcessor to generate the pixel values to be passed to the vision tower.

If the underlying model (in this case, microsoft/Phi-3-vision-128k-instruct) is sensitive to the noise level changes, then I don't think it's vLLM's responsibility to deal with such issue.

Perhaps a good way to debug this is to test these inputs with transformers and see if the model will be able to generate correct responses, and if that's the case, we will then know for sure there's something wrong in the model implementation in vLLM that we need look into.

ywang96 commented 4 months ago

cc @Isotr0py if you have any idea about this since you worked on the PR to add this model.

pseudotensor commented 4 months ago

Ok will compare to transformers.

pseudotensor commented 4 months ago

I'm unable to make transformers fail. E.g.:

import base64
from io import BytesIO

from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto", _attn_implementation='flash_attention_2') # use _attn_implementation='eager' to disable flash attention

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

messages = [
    {"role": "user", "content": "<|image_1|>\nWhat do you see?"},
]

# The encoding function I linked previously - but we actually don't use this function in the API server
def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
    """encode image to base64 format."""

    buffered = BytesIO()
    if format == 'JPEG':
        image = image.convert('RGB')
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')

# This is what we use in the API server to load the base64 string to image
def load_image_from_base64(image: str):
    """Load image from base64 format."""
    return Image.open(BytesIO(base64.b64decode(image)))

url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(requests.get(url, stream=True).raw)

response = requests.get(url)
base64_correct = base64.b64encode(response.content).decode('utf-8')
image_encoded_correct = load_image_from_base64(base64_correct)

assert image == image_encoded_correct, "images are not the same"

# incorrect way to encode an image from url
base64_wrong = encode_image_base64(image)
image_encoded_wrong = load_image_from_base64(base64_wrong)

assert image != image_encoded_wrong, "images are not the same"

prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image_encoded_wrong], return_tensors="pt").to("cuda:0")

generation_args = {
    "max_new_tokens": 1024,
    "temperature": 0.0,
    "do_sample": False,
}

generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)

# remove input tokens
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(response)

gives:

The image shows a nighttime scene of a cityscape with the iconic Big Ben clock tower illuminated against the dark sky. The clock face is clearly visible, showing the time. In the foreground, there is a busy street with fast-moving traffic, creating a streak of light trails. The buildings in the background have their lights on, and the overall atmosphere is bustling and vibrant.

If vllm really did the equivalent of transformers and (say) took the base64 and converted to an image and passed it to transformers processor, then should be all good.

pseudotensor commented 4 months ago

Here I go over many prompts. Transformers is always stable with the "bad encoding" image version.

import base64
from io import BytesIO

from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto",
                                             _attn_implementation='flash_attention_2')  # use _attn_implementation='eager' to disable flash attention

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# The encoding function I linked previously - but we actually don't use this function in the API server
def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
    """encode image to base64 format."""

    buffered = BytesIO()
    if format == 'JPEG':
        image = image.convert('RGB')
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')

# This is what we use in the API server to load the base64 string to image
def load_image_from_base64(image: str):
    """Load image from base64 format."""
    return Image.open(BytesIO(base64.b64decode(image)))

url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(requests.get(url, stream=True).raw)

response = requests.get(url)
base64_correct = base64.b64encode(response.content).decode('utf-8')
image_encoded_correct = load_image_from_base64(base64_correct)

assert image == image_encoded_correct, "images are not the same"

# incorrect way to encode an image from url
base64_wrong = encode_image_base64(image)
image_encoded_wrong = load_image_from_base64(base64_wrong)

assert image != image_encoded_wrong, "images are not the same"

generation_args = {
    "max_new_tokens": 1024,
    "temperature": 0.0,
    "do_sample": False,
}

prompts = [
    "What do you see?",
    "Describe the image.",
    "What is in the image?",
    "Can you tell me what you see?",
    "Can you describe the image?",
    "Can you tell me what is in the image?",
    "Can you describe what you see?",
    "Can you tell me what you see in the image?",
    "Can you describe what you see in the image?",
    "Can you tell me what is in the image?",
    "Can you describe what is in the image?",
    "Can you tell me what you see?",
    "Can you describe what you see?",
    "Can you tell me what you see in the image?",
    "Can you describe what you see in the image?",
    "Can you tell me what is in the image?",
]

responses = []
for prompt in prompts:
    messages = [
        {"role": "user", "content": f"<|image_1|>\n{prompt}"},
    ]

    prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(prompt, [image_encoded_wrong], return_tensors="pt").to("cuda:0")

    generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)

    # remove input tokens
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
    response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

    print(response)
    responses.append(response)

print(responses)

gives:

['The image shows a nighttime scene of a cityscape with the iconic Big Ben clock tower illuminated against the dark sky. The clock face is clearly visible, showing the time. In the foreground, there is a busy street with fast-moving traffic, creating a streak of light trails. The buildings in the background have their lights on, and the overall atmosphere is bustling and vibrant.', "The image captures a nighttime scene of the Big Ben clock tower in London, illuminated against a dark sky. The tower is the central focus, with its clock face clearly visible. In the foreground, there's a busy street with moving traffic, creating a streak of light trails. The overall atmosphere is urban and bustling.", 'The image features the iconic Big Ben clock tower illuminated at night, with a busy street scene in front of it, including moving vehicles and a clear sky.', 'Certainly, the image captures a vibrant night scene in London. Dominating the left side of the frame is the iconic Big Ben clock tower, its large clock face illuminated against the dark sky. The tower is a warm yellow, contrasting with the black night sky. \n\nIn the foreground, the bustling city life is evident with a busy street filled with cars and buses. The vehicles are blurred, suggesting they are moving at high speed, adding a dynamic element to the scene. \n\nIn the background, the Houses of Parliament can be seen, their distinctive architecture standing out even in the night. The image is taken from a low angle, which emphasizes the height of the Big Ben tower and gives a sense of scale to the cityscape. \n\nOverall, the image beautifully encapsulates a typical night in London, with its iconic landmarks and lively city life.', "The image captures a vibrant night scene in London. Dominating the left side of the frame is the iconic Big Ben, its clock face illuminated in white against the dark sky. The tower, a beacon of light, stands tall amidst the city's bustling activity.\n\nIn the foreground, the city's nightlife is in full swing. A busy street stretches out, lined with buildings that glow with warm lights. The street is a blur of motion, cars and buses streaking past, their headlights and taillights creating streaks of light that add to the dynamic energy of the scene.\n\nAbove it all, the sky is a deep, inky black, punctuated by the bright white lights of the city below. The contrast between the dark sky and the illuminated city creates a striking visual effect.\n\nDespite the bustling activity, there's a sense of order and harmony in the scene. The Big Ben, the street, and the buildings all coexist in this snapshot of London at night, each contributing to the city's unique charm.", 'The image shows a nighttime cityscape featuring the iconic Big Ben clock tower prominently in the foreground. The clock face is illuminated, and the tower is lit up with warm lights. In the background, there are other buildings with lights on, and a busy street scene with moving vehicles, creating a blur effect. The sky is dark, suggesting it is nighttime.', "Certainly, the image captures a vibrant night scene in London. Dominating the left side of the frame is the iconic Big Ben, its clock face illuminated against the dark sky. The tower is a warm yellow, contrasting with the blackness of the night. \n\nIn the foreground, the bustling city life is evident. A busy street stretches out, filled with cars and buses, their lights adding to the city's luminescence. The perspective of the photo is from ground level, looking up at the tower, giving a sense of the tower's grandeur.\n\nIn the background, the Palace of Westminster can be seen, its lights twinkling in the night. The sky above is a deep black, dotted with a few stars, adding a touch of serenity to the otherwise lively scene.\n\nThe image is a beautiful blend of architectural marvel and urban life, capturing the essence of London at night.", "The image captures a vibrant night scene in London. Dominating the left side of the frame is the iconic Big Ben, its clock face illuminated in white against the dark sky. The tower, a beacon of light, stands tall amidst the city's bustling activity.\n\nIn the foreground, the city's nightlife is in full swing. A busy street stretches out, lined with a variety of vehicles including buses and cars. The vehicles, though blurred due to motion, add a dynamic element to the scene.\n\nThe background is a blend of other architectural marvels, including the Houses of Parliament and the London Eye. These structures, though not as brightly lit as Big Ben, contribute to the overall illumination of the cityscape.\n\nThe image is a testament to London's lively nightlife and its architectural grandeur. It's a snapshot of a city that never sleeps, captured in the stillness of a single frame.", 'Certainly, the image captures a vibrant night scene in London. The iconic Big Ben clock tower stands tall, its clock face illuminated in white. The tower is surrounded by a cluster of buildings, their windows glowing with warm light. The sky above is a deep black, dotted with a few stars. The foreground is a blur of light trails from moving vehicles, suggesting the bustling city life.', 'The image shows a nighttime cityscape featuring the iconic Big Ben clock tower prominently in the foreground. The clock face is illuminated, and the tower is lit up with warm lights. In the background, there are other buildings with lights on, and a busy street scene with moving vehicles, creating a blur effect. The sky is dark, suggesting it is nighttime.', "The image shows the iconic Big Ben tower illuminated at night with a dark sky. In the foreground, there's a busy street scene with moving traffic, including a double-decker bus, and light trails from moving vehicles.", 'Certainly, the image captures a vibrant night scene in London. Dominating the left side of the frame is the iconic Big Ben clock tower, its large clock face illuminated against the dark sky. The tower is a warm yellow, contrasting with the black night sky. \n\nIn the foreground, the bustling city life is evident with a busy street filled with cars and buses. The vehicles are blurred, suggesting they are moving at high speed, adding a dynamic element to the scene. \n\nIn the background, the Houses of Parliament can be seen, their distinctive architecture standing out even in the night. The image is taken from a low angle, which emphasizes the height of the Big Ben tower and gives a sense of scale to the cityscape. \n\nOverall, the image beautifully encapsulates a typical night in London, with its iconic landmarks and lively city life.', "Certainly, the image captures a vibrant night scene in London. Dominating the left side of the frame is the iconic Big Ben, its clock face illuminated against the dark sky. The tower is a warm yellow, contrasting with the blackness of the night. \n\nIn the foreground, the bustling city life is evident. A busy street stretches out, filled with cars and buses, their lights adding to the city's luminescence. The perspective of the photo is from ground level, looking up at the tower, giving a sense of the tower's grandeur.\n\nIn the background, the Palace of Westminster can be seen, its lights twinkling in the night. The sky above is a deep black, dotted with a few stars, adding a touch of serenity to the otherwise lively scene.\n\nThe image is a beautiful blend of architectural marvel and urban life, capturing the essence of London at night.", "The image captures a vibrant night scene in London. Dominating the left side of the frame is the iconic Big Ben, its clock face illuminated in white against the dark sky. The tower, a beacon of light, stands tall amidst the city's bustling activity.\n\nIn the foreground, the city's nightlife is in full swing. A busy street stretches out, lined with a variety of vehicles including buses and cars. The vehicles, though blurred due to motion, add a dynamic element to the scene.\n\nThe background is a blend of other architectural marvels, including the Houses of Parliament and the London Eye. These structures, though not as brightly lit as Big Ben, contribute to the overall illumination of the cityscape.\n\nThe image is a testament to London's lively nightlife and its architectural grandeur. It's a snapshot of a city that never sleeps, captured in the stillness of a single frame.", 'Certainly, the image captures a vibrant night scene in London. The iconic Big Ben clock tower stands tall, its clock face illuminated in white. The tower is surrounded by a cluster of buildings, their windows glowing with warm light. The sky above is a deep black, dotted with a few stars. The foreground is a blur of light trails from moving vehicles, suggesting the bustling city life.', 'The image shows a nighttime cityscape featuring the iconic Big Ben clock tower prominently in the foreground. The clock face is illuminated, and the tower is lit up with warm lights. In the background, there are other buildings with lights on, and a busy street scene with moving vehicles, creating a blur effect. The sky is dark, suggesting it is nighttime.']

None have that odd "instruction" "Output" stuff.

ywang96 commented 4 months ago

@pseudotensor Thanks for getting back to me on this! I also went back and tried to repro the errors from the main branch, but I can't seem to do so

from PIL import Image
import base64
import requests
from io import BytesIO
from openai import OpenAI

# The encoding function I linked previously - but we actually don't use this function in the API server
def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
    """encode image to base64 format."""

    buffered = BytesIO()
    if format == 'JPEG':
        image = image.convert('RGB')
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')

client = OpenAI(base_url='http://localhost:8000/v1', api_key="EMPTY")

url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"

image = Image.open(BytesIO(requests.get(url).content))

# encode with Image.open()
base64 = encode_image_base64(image)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What do you see?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{base64}"
                },
            },
        ],
    }
]

response = client.chat.completions.create(
    model="microsoft/Phi-3-vision-128k-instruct",
    messages=messages,
    max_tokens=300,
)

print(response.choices[0])
➜  ~ python phi3v.py
Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' The image shows a large, illuminated clock tower at night. The tower has a traditional design with a pointed spire and a clock face visible. The surrounding area is busy with numerous light trails in the sky, suggesting the presence of moving vehicles. The scene is set against a dark night sky, highlighting the bright lights from the tower and vehicles.', role='assistant', function_call=None, tool_calls=[]), stop_reason=None)
➜  ~ python phi3v.py
Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' The image shows a night scene of the Big Ben clock tower in London. The clock face is illuminated, and the tower is a prominent feature against the night sky. In the foreground, there is a street with blurred vehicle lights, giving the impression of a busy, urban environment.', role='assistant', function_call=None, tool_calls=[]), stop_reason=None)
➜  ~ python phi3v.py
Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' The image shows the iconic Big Ben tower lit up at night with a dark sky above. Below the tower, there is a busy city street with blurred lights, likely from moving vehicles. In the background, there are other buildings with lights on, contributing to the urban night scene.', role='assistant', function_call=None, tool_calls=[]), stop_reason=None)
➜  ~ python phi3v.py
Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' The image captures the iconic Big Ben and the Houses of Parliament in London at night. The clock tower is illuminated and stands out against the dark sky. The reflection of Big Ben on the Thames River is visible, and the lights from the surrounding buildings create a vibrant cityscape. Traffic lights and moving vehicles are visible in the foreground, indicating that this is a bustling urban scene.', role='assistant', function_call=None, tool_calls=[]), stop_reason=None)

Can you try running from the main brach and see if you can still repro this?

ywang96 commented 4 months ago

Another thought I have - I wonder if this has something to do with b64decode vs. urlsafe_b64decode. If you don't mind, could you also try replace b64decode with urlsafe_b64decode here and see if that will fix it? (I should have used the latter anyways, but I really wonder if that's the root cause) https://github.com/vllm-project/vllm/blob/e9de9dd551ac595a9f3825fcd1507deceef4f332/vllm/multimodal/utils.py#L70

pseudotensor commented 4 months ago

I can confirm. The same kinds of tests no longer fail on vllm main. What fixed it?

ywang96 commented 4 months ago

I'm not sure but checking main branch commits I can only think of #5772. If you don't mind, please feel free to close this issue after you test with more prompts!

pseudotensor commented 4 months ago

I tried 100 random prompts with that same image, and no issues.