Are there any zero-shot classification results?

invictus717 commented 17 hours ago

Are there any zero-shot classification results? In addition, are there more VLM evaluation results? Current experimental results seem not convincing enough.

invictus717 commented 17 hours ago

Meanwhile, have you tried to scale up the vision encoder? The main results are conducted with ViT-L.

invictus717 commented 17 hours ago

Thanks in advance!

Yif-Yang commented 16 hours ago

Thanks for comments.

We will soon release more LLM2CLIP models, including larger versions of EVA02 ViT-E, InternVL, and SIgLIP.
Our current improvement in ImageNet zero-shot results is not significant, as we believe that single-word classification is not a task that LLMs excel at. However, we will soon organize various downstream classification task datasets and update the results.
More VLM results will be updated in an upcoming paper, focusing on how to use LLM2CLIP to further enhance verious VLM's performance.

invictus717 commented 16 hours ago

LLM2CLIP does not bring out significant improvements on imagenet-1k only or all these zero-shot benchmarks?

Have you ever measured the average caption length between your method and vanilla EVA-02-CLIP? In my opinion, longer text captions do not always bring out improvements.

It's reasonable to improve the performances of VLMs on the SQA and Wizwiz benchmarks while it's strange to drop the performances on the fundamental benchmarks such as MME.

invictus717 commented 16 hours ago

"We utilized the ShareCaptioner-modified CC-3M dataset (Zheng et al., 2024; Chen et al.,2023), which provides both original captions and augmented dense captions for each image, for contrastive learning."

May I ask how much the performance improvement by the re-captioned high-quality CC-3M dataset? Is the target of such reception to improve the caption length? Then LLM2CLIP can pretrain the ViT with longer and detailed captions and then significantly improve the retrieval performances. Is there an ablation study on the average caption length for LLM2CLIP?

Thank you so much for the prompt response!

Yif-Yang commented 15 hours ago

We haven’t specifically tested it, and the improvement on ImageNet is indeed not very noticeable. With OpenAI’s CLIP, we can achieve about a one-point improvement, which is relatively modest compared to other retrieval tasks. My guess is that we used a large amount of dense captions, which may cause the model to favor more complex text. However, we have found in experiments that ImageNet performance is strongly correlated with data volume, possibly related to the word distribution used during alignment. We only used 15 million data points for the alignment in LLM fine-tuning. In the next version, we’ll increase the training data for LLM2CLIP by tens of times, so we plan to re-evaluate it then.

The improvement of long captions or dense captions for CLIP is quite limited. Works like LongCLIP (https://arxiv.org/abs/2403.15378) and DCI (https://arxiv.org/abs/2312.08578) specifically address this issue. The problem here is that the original CLIP text encoder lacks the ability to understand such information or handle captions of this length. However, LLM2CLIP, even when trained on a fully short-text dataset, still demonstrates outstanding and leading performance, as shown in Table 5 of the paper.

microsoft / LLM2CLIP

Are there any zero-shot classification results? #4