tencent-ailab / Leopard

The repository for the paper titled "Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks"
15 stars 0 forks source link

LEOPARD : A Vision Language Model for Text-Rich Multi-Image Tasks

This is the repository for Leopard, a MLLM that is specifically designed to handle complex vision-language tasks involving multiple text-rich images. In real-world applications, such as presentation slides, scanned documents, and webpage snapshots, understanding the inter-relationships and logical flow across multiple images is crucial.

The code, data, and model checkpoints will be released in one month. Stay tuned!

Auto-Instruct Illustration

Key Features:


Auto-Instruct Illustration