GroundingGPT is an end-to-end multimodal grounding model that accurately comprehends inputs and possesses robust grounding capabilities across multi modalities,including images, audios, and videos. To address the issue of limited data, we construct a diverse and high-quality multimodal training dataset. This dataset encompasses a rich collection of multimodal data enriched with spatial and temporal information, thereby serving as a valuable resource to foster further advancements in this field. Extensive experimental evaluations validate the effectiveness of the GroundingGPT model in understanding and grounding tasks across various modalities.
More details are available in our project page.
The overall structure of GroundingGPT. Blue boxes represent video as input, while yellow boxes represent image as input.
git clone https://github.com/lzw-lzw/GroundingGPT.git
cd GroundingGPT
conda create -n groundinggpt python=3.10 -y
conda activate groundinggpt
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
./ckpt
../ckpt/imagebind
../ckpt
.dataset
.-
GroundingGPT/lego/serve/cli.py
Use the script to inference
python3 lego/serve/cli.py
GroundingGPT/lego/serve/gradio_web_server.py
Use the script to launch a gradio web demo
python3 lego/serve/gradio_web_server.py
If you find GroundingGPT useful for your your research and applications, please cite using this BibTeX:
@inproceedings{li2024groundinggpt,
title={Groundinggpt: Language enhanced multi-modal grounding model},
author={Li, Zhaowei and Xu, Qi and Zhang, Dong and Song, Hang and Cai, Yiqing and Qi, Qi and Zhou, Ran and Pan, Junting and Li, Zefeng and Tu, Vu and others},
booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={6657--6678},
year={2024}
}