huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.61k stars 26.92k forks source link

Patches for different modalities #34585

Open sm745052 opened 4 days ago

sm745052 commented 4 days ago

I was trying to find the implementation of where the patches are being created. What I understand according to paper is that when there are multiple images, complete images should be used instead of creating patches for each image. however, I could not find the implementation to it.

I was looking here https://github.com/huggingface/transformers/blob/33868a057c02f0368ba63bd1edb746be38fe3d90/src/transformers/models/llava_onevision/image_processing_llava_onevision.py#L680

And am referring to this excerpt

C.1 Token Strategy for Mixed-Modality Data
We provide a detailed explanation of our token strategy for handling mixed-modality data within
LLaVA-OneVision’s architecture, which is illustrated in Figure 3.
For single-image data, we employ the AnyResMax-9 strategy, as previously outlined in blog [64].
Using SO400M [158] as the Vision Encoder, each input image (or grid) is processed into 729 visual
tokens. Consequently, the maximum number of visual tokens for a single image is 729 × (1 + 9),
where 1 × 729 represents the base tokens and 9 × 729 accounts for the grid tokens.
For multi-image data, we utilize a simple padding strategy. Each image is first resized to fit within a
384x384 frame by zero-padding, as required by SO400M, while maintaining the aspect ratio. After
processing through the vision encoder, the zero-padding is removed from the tokens. Our training
data includes up to 12 images per instance, resulting in a maximum of 12 × 729 multi-image tokens.
For video data, we adopt a strategy similar to LLaVA-NeXT-Video [169]. Each frame is processed
through the vision encoder and then subjected to 2 × 2 bilinear interpolation, resulting in 196 tokens
per frame. We sample up to 32 frames per video, leading to a maximum of 32 × 196 video tokens.
As shown in Figure 3, the maximum number of tokens across different modalities is approximately
equal. This design strategy aims to balance the data from various modalities, ensuring more equitable
representation that is transferable from the perspective of the language model. For instance, a highresolution image can be interpreted as a composition of multiple images, and multiple images can be
understood as a shorter video.

any help would be really helpful.

zucchini-nlp commented 4 days ago

Hey @sm745052 !

Indeed the paper says that multiple images should not be divided into patches but the LLaVA-OV model was shipped following the inference pipeline from their demo notebook here . In the notebook the images are divided into patches in mutli-image cases.

I think they might have made a mistake in the notebook thus we shipped the wrong inference model, or the paper meant using only base image for tuning the model with many images. Can you also open an issue in LLaVa-VL repo to clarify this, and we can make the appropriate changes if they confirm that the inference notebook is not correct :)