[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)
Reproduction
Before #33608 (testing on commit d00f1ca860f19f4c0962882e56044bb6ef7b5626) the code below would run without error:
import torch
from transformers import LlavaForConditionalGeneration, LlavaProcessor
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor.patch_size = 14
processor.vision_feature_select_strategy = "default"
device = torch.device("cuda")
model = model.eval()
model = model.to(device)
inputs = processor(
text=["Sentence with two images 1. <image> 2. <image>", "Sentence with one image <image>"],
images= torch.rand((3, 3, 336, 336), dtype=torch.float),
return_tensors="pt",
truncation=True,
padding=True,
)
inputs = inputs.to(device)
with torch.no_grad():
model(**inputs)
However, after #33608 (testing on commit 0f49deacbff3e57cde45222842c0db6375e4fa43), it fails with the error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[1], line 21
19 inputs = inputs.to(device)
20 with torch.no_grad():
---> 21 model(**inputs)
File ~/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
1734 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)
File ~/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
1742 # If we don't have any hooks, we want to skip the rest of the logic in
1743 # this function, and just call forward.
1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1749 result = None
1750 called_always_called_hooks = set()
File ~/miniconda3/envs/hf/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py:524, in LlavaForConditionalGeneration.forward(self, input_ids, pixel_values, attention_mask, position_ids, past_key_values, inputs_embeds, vision_feature_layer, vision_feature_select_strategy, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
522 n_image_features = image_features.shape[1]
523 if n_image_tokens != n_image_features:
--> 524 raise ValueError(
525 f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
526 )
527 special_image_mask = (
528 (input_ids == self.config.image_token_index)
529 .unsqueeze(-1)
530 .expand_as(inputs_embeds)
531 .to(inputs_embeds.device)
532 )
533 image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
ValueError: Image features and image tokens do not match: tokens: 1152, features 576
Expected behavior
Before #33608 multi-image input and variable-image input to LLaVa worked as expected. The added check on image features and image tokens in #33608 doesn't seem to take into account 1. input sequences with multiple images 2. batches with variable number of images in each input sequence.
System Info
transformers version: 4.46.0.dev0
Who can help?
@amyeroberts, @qubvel
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Before #33608 (testing on commit d00f1ca860f19f4c0962882e56044bb6ef7b5626) the code below would run without error:
However, after #33608 (testing on commit 0f49deacbff3e57cde45222842c0db6375e4fa43), it fails with the error
Expected behavior
Before #33608 multi-image input and variable-image input to LLaVa worked as expected. The added check on image features and image tokens in #33608 doesn't seem to take into account 1. input sequences with multiple images 2. batches with variable number of images in each input sequence.