Questions about experiment details

Hello, indeed, the original Llava lacks the ability to distinguish between similar captions, which is a fundamental obstacle preventing LLMs from effectively aiding CLIP training.

In the paper's introduction, we explained: “To validate our hypothesis, we designed a caption-to-caption retrieval experiment, as shown in Table 1 and Figure 2. Each image in the MS-COCO dataset has five human-annotated captions. We selected the first two captions as positive samples and performed retrieval across the entire validation set. Using the caption retrieval accuracy (CRA), we evaluated the text model’s ability to differentiate between captions, helping us determine which language model is better suited for CLIP. We found that Llama-3 8B achieved only 18.4% top-1 accuracy, while the standard CLIP-ViT-L reached 66.0% top-1 accuracy. As illustrated in Figure 2, the top-1 caption retrieved by the original Llama-3 can be entirely unrelated to the query caption, clearly obstructing effective CLIP learning. Therefore, directly using an LLM to guide CLIP’s visual encoder training is highly constrained.”

We conducted this experiment on the COCO validation set to verify our idea, assessing whether the original LLM could identify different captions for the same image across the entire validation set. This experiment confirmed that the original Llama was essentially unable to achieve this, but after our fine-tuning, it could effectively make these distinctions.

microsoft / LLM2CLIP

Questions about experiment details #7