Welcome to the official repository for LLM2CLIP! This project leverages large language models (LLMs) as powerful textual teachers for CLIP's visual encoder, enabling more nuanced and comprehensive multimodal learning.
Paper: Preprinted and under-review now. Accepted to NeurIPS 2024 Workshop: Self-Supervised Learning - Theory and Practice
Current versions of CLIP face several limitations:
LLM2CLIP brings the unimaginable power of large language models to CLIP, even surpassing native language capabilities. Our LLM2CLIP, fine-tuned purely on an English corpus, outperforms standard Chinese CLIP models:
While LLMs have strong inherent text encoding capabilities, the output space is often not highly separable, which limits their effectiveness for contrastive learning.
To overcome these challenges, we designed a Caption-to-Caption Contrastive Learning strategy. We trained the LLM to better differentiate between captions of the same or different images, enhancing the separability of the LLM's output space. During training, the LLM gradients were frozen while CLIP's visual encoder was fine-tuned on limited data, resulting in significant performance gains.
Through this strategy, we better utilized the LLM's power to comprehend and process long and dense captions, improving the overall representation capabilities.
Stay tuned for updates on pretrained models and datasets, which will be made available in the HuggingFace Model Zoo.
Create the environment:
conda create -n llm2clip python=3.8
conda activate llm2clip
pip install -r requirements.txt
Data Preparation:
(Coming Soon)
🔥 Training:
sh run.sh
Our code is built on top of EVA-CLIP. We would like to thank the EVA team for their foundational work.
If you use our work, please cite:
@misc{huang2024llm2clippowerfullanguagemodel,
title={LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation},
author={Weiquan Huang and Aoqi Wu and Yifan Yang and Xufang Luo and Yuqing Yang and Liang Hu and Qi Dai and Xiyang Dai and Dongdong Chen and Chong Luo and Lili Qiu},
year={2024},
eprint={2411.04997},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.04997},
}