LLaVa with multiple image input throws error: Image features and image tokens do not match

System Info

transformers version: 4.46.0.dev0

Who can help?

@amyeroberts, @qubvel

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Before #33608 (testing on commit d00f1ca860f19f4c0962882e56044bb6ef7b5626) the code below would run without error:

import torch
from transformers import LlavaForConditionalGeneration, LlavaProcessor

model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor.patch_size = 14
processor.vision_feature_select_strategy = "default"

device = torch.device("cuda")
model = model.eval()
model = model.to(device)
inputs = processor(
    text=["Sentence with two images 1. <image> 2. <image>", "Sentence with one image <image>"],
    images= torch.rand((3, 3, 336, 336), dtype=torch.float),
    return_tensors="pt",
    truncation=True,
    padding=True,
)
inputs = inputs.to(device)
with torch.no_grad():
    model(**inputs)

However, after #33608 (testing on commit 0f49deacbff3e57cde45222842c0db6375e4fa43), it fails with the error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 21
     19 inputs = inputs.to(device)
     20 with torch.no_grad():
---> 21     model(**inputs)

File ~/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/miniconda3/envs/hf/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py:524, in LlavaForConditionalGeneration.forward(self, input_ids, pixel_values, attention_mask, position_ids, past_key_values, inputs_embeds, vision_feature_layer, vision_feature_select_strategy, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
    522 n_image_features = image_features.shape[1]
    523 if n_image_tokens != n_image_features:
--> 524     raise ValueError(
    525         f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
    526     )
    527 special_image_mask = (
    528     (input_ids == self.config.image_token_index)
    529     .unsqueeze(-1)
    530     .expand_as(inputs_embeds)
    531     .to(inputs_embeds.device)
    532 )
    533 image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)

ValueError: Image features and image tokens do not match: tokens: 1152, features 576

Expected behavior

Before #33608 multi-image input and variable-image input to LLaVa worked as expected. The added check on image features and image tokens in #33608 doesn't seem to take into account 1. input sequences with multiple images 2. batches with variable number of images in each input sequence.

huggingface / transformers