Patches for different modalities

I was trying to find the implementation of where the patches are being created. What I understand according to paper is that when there are multiple images, complete images should be used instead of creating patches for each image. however, I could not find the implementation to it.

I was looking here https://github.com/huggingface/transformers/blob/33868a057c02f0368ba63bd1edb746be38fe3d90/src/transformers/models/llava_onevision/image_processing_llava_onevision.py#L680

And am referring to this excerpt

C.1 Token Strategy for Mixed-Modality Data
We provide a detailed explanation of our token strategy for handling mixed-modality data within
LLaVA-OneVision’s architecture, which is illustrated in Figure 3.
For single-image data, we employ the AnyResMax-9 strategy, as previously outlined in blog [64].
Using SO400M [158] as the Vision Encoder, each input image (or grid) is processed into 729 visual
tokens. Consequently, the maximum number of visual tokens for a single image is 729 × (1 + 9),
where 1 × 729 represents the base tokens and 9 × 729 accounts for the grid tokens.
For multi-image data, we utilize a simple padding strategy. Each image is first resized to fit within a
384x384 frame by zero-padding, as required by SO400M, while maintaining the aspect ratio. After
processing through the vision encoder, the zero-padding is removed from the tokens. Our training
data includes up to 12 images per instance, resulting in a maximum of 12 × 729 multi-image tokens.
For video data, we adopt a strategy similar to LLaVA-NeXT-Video [169]. Each frame is processed
through the vision encoder and then subjected to 2 × 2 bilinear interpolation, resulting in 196 tokens
per frame. We sample up to 32 frames per video, leading to a maximum of 32 × 196 video tokens.
As shown in Figure 3, the maximum number of tokens across different modalities is approximately
equal. This design strategy aims to balance the data from various modalities, ensuring more equitable
representation that is transferable from the perspective of the language model. For instance, a highresolution image can be interpreted as a composition of multiple images, and multiple images can be
understood as a shorter video.

any help would be really helpful.

huggingface / transformers

Patches for different modalities #34585