hustvl / ViTMatte

[Information Fusion (Vol.103, Mar. '24)] Boosting Image Matting with Pretrained Plain Vision Transformers
MIT License
363 stars 37 forks source link

ViTMatte🐒

Boosting Image Matting with Pretrained Plain Vision Transformers

[Jingfeng Yao](https://github.com/JingfengYao)1, [Xinggang Wang](https://scholar.google.com/citations?user=qNCTLV0AAAAJ&hl=zh-CN)1 📧, [Shusheng Yang](https://github.com/vealocia)1, [Baoyuan Wang](https://sites.google.com/site/zjuwby/)2 1 School of EIC, HUST, 2 Xiaobing.AI (📧) corresponding author. [![arxiv paper](https://img.shields.io/badge/arxiv-paper-orange)](https://arxiv.org/abs/2305.15272) [![Colab Demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Dc2qoJueNZQyrTU19sIcrPyRDmvuMTF3?usp=sharing) [![Static Badge](https://img.shields.io/badge/bilibili-tutorial-pink)](https://www.bilibili.com/video/BV191421q7PX/?spm_id_from=333.337.search-card.all.click&vd_source=d77720fde1697e9f7510096fea727a91) [![license](https://img.shields.io/badge/license-MIT-blue)](LICENSE) [![authors](https://img.shields.io/badge/by-hustvl-green)](https://github.com/hustvl) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitmatte-boosting-image-matting-with/image-matting-on-composition-1k-1)](https://paperswithcode.com/sota/image-matting-on-composition-1k-1?p=vitmatte-boosting-image-matting-with) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitmatte-boosting-image-matting-with/image-matting-on-distinctions-646)](https://paperswithcode.com/sota/image-matting-on-distinctions-646?p=vitmatte-boosting-image-matting-with)

#

News

Introduction

Plain Vision Transformer could also do image matting with simple ViTMatte framework!

avatar

Recently, plain vision Transformers (ViTs) have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of image matting. We hypothesize that image matting could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a hybrid attention mechanism combined with a convolution neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple lightweight convolutions to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on image matting with concise adaptation. It inherits many superior properties from ViT to matting, including various pretraining strategies, concise architecture design, and flexible inference strategies. We evaluate ViTMatte on Composition-1k and Distinctions-646, the most commonly used benchmark for image matting, our method achieves state-of-the-art performance and outperforms prior matting works by a large margin.

Get Started

Demo

You could try to matting the demo image with its corresponding trimap by run:

python run_one_image.py \
    --model vitmatte-s \
    --checkpoint-dir path/to/checkpoint

The demo images will be saved in ./demo. You could also try with your own image and trimap with the same file.

Besides, you can also try ViTMatte in Colab Demo. It is a simple demo to show the ability of ViTMatte.

Results

Quantitative Results on Composition-1k Model SAD MSE Grad Conn checkpoints
ViTMatte-S 21.46 3.3 7.24 16.21 GoogleDrive
ViTMatte-B 20.33 3.0 6.74 14.78 GoogleDrive
Quantitative Results on Distinctions-646 Model SAD MSE Grad Conn checkpoints
ViTMatte-S 21.22 2.1 8.78 17.55 GoogleDrive
ViTMatte-B 17.05 1.5 7.03 12.95 GoogleDrive

Citation

@article{yao2024vitmatte,
  title={ViTMatte: Boosting image matting with pre-trained plain vision transformers},
  author={Yao, Jingfeng and Wang, Xinggang and Yang, Shusheng and Wang, Baoyuan},
  journal={Information Fusion},
  volume={103},
  pages={102091},
  year={2024},
  publisher={Elsevier}
}