This is the official repository for the multi-modal large language models: LaVIT and Video-LaVIT. The LaVIT project aims to leverage the exceptional capability of LLM to deal with visual content. The proposed pre-training strategy supports visual understanding and generation with one unified framework.
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization, ICLR 2024, [arXiv
] [BibTeX
]
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization, [arXiv
] [Project
] [BibTeX
]
2024.04.21
🚀🚀🚀 We have released the pre-trained weight for Video-LaVIT on the HuggingFace and provide the inference code.
2024.02.05
🌟🌟🌟 We have proposed the Video-LaVIT: an effective multimodal pre-training approach that empowers LLMs to comprehend and generate video content in a unified framework.
2024.01.15
👏👏👏 LaVIT has been accepted by ICLR 2024!
2023.10.17
🚀🚀🚀 We release the pre-trained weight for LaVIT on the HuggingFace and provide the inference code of using it for both multi-modal understanding and generation.
The LaVIT and Video-LaVIT are general-purpose multi-modal foundation models that inherit the successful learning paradigm of LLM: predicting the next visual/textual token in an auto-regressive manner. The core design of the LaVIT series works includes a visual tokenizer and a detokenizer. The visual tokenizer aims to translate the non-linguistic visual content (e.g., image, video) into a sequence of discrete tokens like a foreign language that LLM can read. The detokenizer recovers the generated discrete tokens from LLM to the continuous visual signals.