Llava-Next with vision_feature_select_strategy == "full" returns error image size mismatch

insujang commented 1 month ago

System Info

transformers==4.42.3

Who can help?

@zucchini-nlp

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Working example: using `default` vision_feature_select_strategy

from transformers.models.llava_next import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import requests
import torch

model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-vicuna-7b-hf").to(dtype=torch.bfloat16, device=torch.device("cuda"))
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-vicuna-7b-hf")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image_cats = Image.open(requests.get(url, stream=True).raw)

prompt = "USER: <image>\nWhat is shown in this image? ASSISTANT:"

input = processor(prompt, image_cats, padding=True, return_tensors="pt").to("cuda")
output = model.generate(**input, max_length=128)
processor.decode(output[0], skip_special_tokens=True)

'USER: \nWhat is shown in this image? ASSISTANT: The image shows two cats lying on a pink surface, which appears to be a couch or a bed. The cats are resting with their eyes closed, suggesting they are either sleeping or very relaxed. There are also two remote controls placed near the cats, which might indicate that the cats are in a living room or a space where people watch television. The cats have different patterns on their fur, with one being a tabby and the other a calico.'

Buggy example with `full` vision_feature_select_strategy

from transformers.models.llava_next import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import requests
import torch

model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-vicuna-7b-hf",
    vision_feature_select_strategy="full"
).to(dtype=torch.bfloat16, device=torch.device("cuda"))
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-vicuna-7b-hf")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image_cats = Image.open(requests.get(url, stream=True).raw)

prompt = "USER: <image>\nWhat is shown in this image? ASSISTANT:"

input = processor(prompt, image_cats, padding=True, return_tensors="pt").to("cuda")
output = model.generate(**input, max_length=128)
processor.decode(output[0], skip_special_tokens=True)

File /opt/conda/lib/python3.10/site-packages/transformers/models/llava_next/modeling_llava_next.py:662, in LlavaNextForConditionalGeneration.pack_image_features(self, image_features, image_sizes, image_newline)
height = width = self.config.vision_config.image_size // self.config.vision_config.patch_size
if height * width != base_image_feature.shape[0]:
    raise ValueError("The number of patches is not consistent with the image size.")

ValueError: The number of patches is not consistent with the image size.

Side note: there is another bug. If you pass vision_feature_select_strategy to .generate() function, it is not passed to LlavaNextForConditionalGeneration.forward() and it uses default strategy.
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-vicuna-7b-hf").to(dtype=torch.bfloat16, > device=torch.device("cuda"))
input = processor(prompt, image_cats, padding=True, return_tensors="pt").to("cuda")
output = model.generate(**input, vision_feature_select_strategy="full", max_length=128)
The example above "seems" to use full strategy, but it is actually using default strategy and returns no error, which is not intended.

Back to the main bug example, the reason of such error is due to the following lines: https://github.com/huggingface/transformers/blob/c1aa0edb48217f416f4bbe6e3a9db1500284513b/src/transformers/models/llava_next/modeling_llava_next.py#L789-L792

Here, selected_image_feature shape before applying the select strategy is [5, 577, 1024]. If we use default select strategy one feature is removed from the second dimension, resulting in [5, 576, 1024] feature dimension, which can pass the condition in self.pack_image_features(): https://github.com/huggingface/transformers/blob/c1aa0edb48217f416f4bbe6e3a9db1500284513b/src/transformers/models/llava_next/modeling_llava_next.py#L662-L663 with height=width=24 thus height * weight = 576.

However, if full select strategy is used we have the feature with unchanged dimension [5, 577, 1024] which raises ValueError in self.pack_image_features() as height * width == 576 != 577.

Expected behavior

Runs without an error, potentially having a different output from that with default strategy.

zucchini-nlp commented 1 month ago

Hey @insujang! Right, this check passes only when CLS token is removed from image representation. We can check that height * width == 576 if selection strategy is default. Otherwise check that it is height * width + 1. Would you like to open a PR for that?

github-actions[bot] commented 2 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers