Closed lbuess closed 6 months ago
Presented.
@lbuess I had few questions about the paper. Let's collect them here, along with the paper suggestion GitHub issue.
Questions:
@lbuess I had few questions about the paper. Let's collect them here, along with the paper suggestion GitHub issue.
Questions:
- Is the text model initialized from pretrained model or random initialized ?
- Does the predictive performance of the model depend on the order of the tokens? (What about text first and image later? What about image tokens in middle of text tokens)
- Why exactly does the Interleaved-MoF MLLM method (3rd Method from the Figure 7) perform well? Is it right to say that the amount of "fusion" compared between the Additive-MoF MLLM and Interleaved-MoF MLLM is different and if so, which would you say has more amount of fusion?
- Can we interleave, say 2 tokens from DINO followed by 2 tokens from CLIP and so on… ?
The paper does not discuss model initialization in its main text. However, a review of the GitHub repository shows that the method, which is based on the LLaVA model, uses Vicuna weights to initialize the LLM.
For multimodal models like LLaVA, which process text and image data, the order of inputs can affect performance, depending on the model's training procedure. Based on my understanding of LLaVA's training process, the prompting during training looks like this:
“User: \
Consequently, to optimize performance, it would be better to also maintain this order during inference.
The interleaved method likely performs well because it allows the model to selectively utilize relevant features, whereas the additive method reduces the impact of individual SSL or CLIP features by combining them beforehand. Determining which approach achieves more amount of fusion is difficult. In the additive method, fusion occurs prior to processing by the LLM, while in the interleaved method, fusion is integrated through the LLM's attention mechanism.
In theory, you could alternate tokens from DINO and CLIP in the way you proposed because the model uses positional encodings to track where each token belongs. However, for simplicity I think it’s easier to keep tokens from the same part of the image together.
Thanks for the answer!
Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.
Paper Link: https://arxiv.org/abs/2401.06209