X-Decoder: Generalized Decoding for Pixel, Image, and Language
[Project Page] [Paper] [HuggingFace All-in-One Demo] [HuggingFace Instruct Demo] [Video]
by Xueyan Zou*, Zi-Yi Dou*, Jianwei Yang*, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee^, Jianfeng Gao^ in CVPR 2023.
:hot_pepper: Getting Started
We release the following contents for both SEEM and X-Decoder:exclamation:
- [x] Demo Code
- [x] Model Checkpoint
- [x] Comprehensive User Guide
- [x] Training Code
- [x] Evaluation Code
:point_right: One-Line SEEM Demo with Linux:
git clone git@github.com:UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git && sh aasets/scripts/run_demo.sh
:round_pushpin: [New] Getting Started:
:round_pushpin: [New] Latest Checkpoints and Numbers: |
|
|
|
COCO |
|
|
Ref-COCOg |
|
|
VOC |
|
SBD |
|
Method |
Checkpoint |
backbone |
PQ ↑ |
mAP ↑ |
mIoU ↑ |
cIoU ↑ |
mIoU ↑ |
AP50 ↑ |
NoC85 ↓ |
NoC90 ↓ |
NoC85 ↓ |
NoC90 ↓ |
X-Decoder |
ckpt |
Focal-T |
50.8 |
39.5 |
62.4 |
57.6 |
63.2 |
71.6 |
- |
- |
- |
- |
X-Decoder-oq201 |
ckpt |
Focal-L |
56.5 |
46.7 |
67.2 |
62.8 |
67.5 |
76.3 |
- |
- |
- |
- |
SEEM_v0 |
ckpt |
Focal-T |
50.6 |
39.4 |
60.9 |
58.5 |
63.5 |
71.6 |
3.54 |
4.59 |
* |
* |
SEEM_v0 |
- |
Davit-d3 |
56.2 |
46.8 |
65.3 |
63.2 |
68.3 |
76.6 |
2.99 |
3.89 |
5.93 |
9.23 |
SEEM_v0 |
ckpt |
Focal-L |
56.2 |
46.4 |
65.5 |
62.8 |
67.7 |
76.2 |
3.04 |
3.85 |
* |
* |
SEEM_v1 |
ckpt |
Focal-T |
50.8 |
39.4 |
60.7 |
58.5 |
63.7 |
72.0 |
3.19 |
4.13 |
* |
* |
SEEM_v1 |
ckpt |
SAM-ViT-B |
52.0 |
43.5 |
60.2 |
54.1 |
62.2 |
69.3 |
2.53 |
3.23 |
* |
* |
SEEM_v1 |
ckpt |
SAM-ViT-L |
49.0 |
41.6 |
58.2 |
53.8 |
62.2 |
69.5 |
2.40 |
2.96 |
* |
* |
SEEM_v0: Supporting Single Interactive object training and inference
SEEM_v1: Supporting Multiple Interactive objects training and inference
:fire: News
- [2023.10.04] We are excited to release :white_check_mark: training/evaluation/demo code, :white_check_mark: new checkpoints, and :white_check_mark: comprehensive readmes for both X-Decoder and SEEM!
- [2023.09.24] We are providing new demo command/code for inference (DEMO.md)!
- [2023.07.19] :roller_coaster: We are excited to release the x-decoder training code (INSTALL.md, DATASET.md, TRAIN.md, EVALUATION.md)!
- [2023.07.10] We release Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available!
- [2023.04.14] We are releasing SEEM, a new universal interactive interface for image segmentation! You can use it for any segmentation tasks, way beyond what X-Decoder can do!
- [2023.03.20] As an aspiration of our X-Decoder, we developed OpenSeeD ([Paper][Code]) to enable open-vocabulary segmentation and detection with a single model, Check it out!
- [2023.03.14] We release X-GPT which is an conversational version of our X-Decoder through GPT-3 langchain!
- [2023.03.01] The Segmentation in the Wild Challenge had been launched and ready for submitting results!
- [2023.02.28] We released the SGinW benchmark for our challenge. Welcome to build your own models on the benchmark!
- [2023.02.27] Our X-Decoder has been accepted by CVPR 2023!
- [2023.02.07] We combine X-Decoder (strong image understanding), GPT-3 (strong language understanding) and Stable Diffusion (strong image generation) to make an instructional image editing demo, check it out!
- [2022.12.21] We release inference code of X-Decoder.
- [2022.12.21] We release Focal-T pretrained checkpoint.
- [2022.12.21] We release open-vocabulary segmentation benchmark.
:paintbrush: DEMO
:blueberries: [X-GPT] :strawberry:[Instruct X-Decoder]
![demo](https://user-images.githubusercontent.com/11957155/225728214-0523bd30-31f7-472d-be7e-12a049c25cbd.gif)
:notes: Introduction
![github_figure](https://user-images.githubusercontent.com/11957155/210801832-c9143c42-ef65-4501-95a5-0d54749dcc52.gif)
X-Decoder is a generalized decoding model that can generate pixel-level segmentation and token-level texts seamlessly!
It achieves:
- State-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets;
- Better or competitive finetuned performance to generalist and specialist models on segmentation and VL tasks;
- Friendly for efficient finetuning and flexible for novel task composition.
It supports:
- One suite of parameters pretrained for Semantic/Instance/Panoptic Segmentation, Referring Segmentation, Image Captioning, and Image-Text Retrieval;
- One model architecture finetuned for Semantic/Instance/Panoptic Segmentation, Referring Segmentation, Image Captioning, Image-Text Retrieval and Visual Question Answering (with an extra cls head);
- Zero-shot task composition for Region Retrieval, Referring Captioning, Image Editing.
Acknowledgement
- We appreciate the contructive dicussion with Haotian Zhang
- We build our work on top of Mask2Former
- We build our demos on HuggingFace :hugs: with sponsored GPUs
- We appreciate the discussion with Xiaoyu Xiang during rebuttal
Citation
@article{zou2022xdecoder,
author = {Zou*, Xueyan and Dou*, Zi-Yi and Yang*, Jianwei and Gan, Zhe and Li, Linjie and Li, Chunyuan and Dai, Xiyang and Wang, Jianfeng and Yuan, Lu and Peng, Nanyun and Wang, Lijuan and Lee*, Yong Jae and Gao*, Jianfeng},
title = {Generalized Decoding for Pixel, Image and Language},
publisher = {arXiv},
year = {2022},
}