lxtGH / OMG-Seg

OMG-LLaVA and OMG-Seg codebase [CVPR-24 and NeurIPS-24]
Other
1.32k stars 50 forks source link

OMG Model Research

Our goal is to solve multiple fundamental visual perception, visual reasoning, and multi-modal large langauge tasks using one model, which minimize handcraft designs and maximize the functionality and performance in one shot.

Short Introduction of OMG-LLaVA, arxiv, Project Page, Introduction by Fahd Mirza

arXiv PDF Project Page Project Page HuggingFace Model Gradio

We present OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information.

OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM.

Short Introduction of OMG-Seg, arxiv, Project Page, Report By viso.ai

arXiv PDF Project Page Project Page HuggingFace Model

We address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the Segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to fill all these tasks in one model and achieve good enough performance.

We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Both the code and models will be publicly available.

Short introduction on VALSE of OMG-Seg with other SAM-like works, can be found here, in Chinese.

News !!

Key Features of OMG-LLaVA

$\color{#2F6EBA}{Bridge\ Image-level\, Object-level\, Pixel-level\, Reasoning\ and\ Understanding\ }$

$\color{#2F6EBA}{The\ First\ OpenSourced\ Universal\ Understanding\ and\ Reasoning\ Codebase}$

Key Features of OMG-Seg

$\color{#2F6EBA}{Universal\ Image\, Video\, Open-Vocabulary\, Segmentation\ Model}$

$\color{#2F6EBA}{Good\ Enough\ Performance}$

$\color{#2F6EBA}{The\ First\ OpenSourced\ Universal\ Segmentation\ Codebase}$

$\color{#2F6EBA}{Easy\ \ Followed\ By\ Academic\ Lab}$

To-Do Lists

How to use this Codebase

For OMG-Seg, please see the OMG_Seg_README.md

For OMG-LLaVA, please see the OMG_LLaVA_README.md

Citation

If you think our codebases and works are useful for your research, please consider referring us:


@inproceedings{OMGLLaVA,
  title={OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding},
  author={Zhang, Tao and Li, Xiangtai and Fei, Hao and Yuan, Haobo and Wu, Shengqiong and Ji, Shunping and Chen, Change Loy and Yan, Shuicheng},
  booktitle={NeurIPS},
  year={2024}
}

@inproceedings{OMGSeg,
  title={OMG-Seg: Is one model good enough for all segmentation?},
  author={Li, Xiangtai and Yuan, Haobo and Li, Wei and Ding, Henghui and Wu, Size and Zhang, Wenwei and Li, Yining and Chen, Kai and Loy, Chen Change},
  booktitle={CVPR},
  year={2024}
}

License

OMG-Seg follows the S-Lab LICENSE.

OMG-LLaVA follows the Apache-2.0 license, for the respect of both LLaVA and XTuner codebase.