Open invictus717 opened 17 hours ago
Meanwhile, have you tried to scale up the vision encoder? The main results are conducted with ViT-L.
Thanks in advance!
Thanks for comments.
LLM2CLIP does not bring out significant improvements on imagenet-1k only or all these zero-shot benchmarks?
Have you ever measured the average caption length between your method and vanilla EVA-02-CLIP? In my opinion, longer text captions do not always bring out improvements.
It's reasonable to improve the performances of VLMs on the SQA and Wizwiz benchmarks while it's strange to drop the performances on the fundamental benchmarks such as MME.
"We utilized the ShareCaptioner-modified CC-3M dataset (Zheng et al., 2024; Chen et al.,2023), which provides both original captions and augmented dense captions for each image, for contrastive learning."
May I ask how much the performance improvement by the re-captioned high-quality CC-3M dataset? Is the target of such reception to improve the caption length? Then LLM2CLIP can pretrain the ViT with longer and detailed captions and then significantly improve the retrieval performances. Is there an ablation study on the average caption length for LLM2CLIP?
Thank you so much for the prompt response!
We haven’t specifically tested it, and the improvement on ImageNet is indeed not very noticeable. With OpenAI’s CLIP, we can achieve about a one-point improvement, which is relatively modest compared to other retrieval tasks. My guess is that we used a large amount of dense captions, which may cause the model to favor more complex text. However, we have found in experiments that ImageNet performance is strongly correlated with data volume, possibly related to the word distribution used during alignment. We only used 15 million data points for the alignment in LLM fine-tuning. In the next version, we’ll increase the training data for LLM2CLIP by tens of times, so we plan to re-evaluate it then.
The improvement of long captions or dense captions for CLIP is quite limited. Works like LongCLIP (https://arxiv.org/abs/2403.15378) and DCI (https://arxiv.org/abs/2312.08578) specifically address this issue. The problem here is that the original CLIP text encoder lacks the ability to understand such information or handle captions of this length. However, LLM2CLIP, even when trained on a fully short-text dataset, still demonstrates outstanding and leading performance, as shown in Table 5 of the paper.
Are there any zero-shot classification results? In addition, are there more VLM evaluation results? Current experimental results seem not convincing enough.