This work is an enhanced version of our NeurIPS paper MAFT.
Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on ADE20K, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ.
git clone https://github.com/jiaosiyu1999/MAFT_Plus.git
cd MAFT_Plus
bash install.sh
cd maft/modeling/pixel_decoder/ops
sh make.sh
See MAFT for reference (Preparing Datasets for MAFT). The data should be organized like:
datasets/
ade/
ADEChallengeData2016/
images/
annotations_detectron2/
ADE20K_2021_17_01/
images/
annotations_detectron2/
coco/
train2017/
val2017/
stuffthingmaps_detectron2/
VOCdevkit/
VOC2012/
images_detectron2/
annotations_ovs/
VOC2010/
images/
annotations_detectron2_ovs/
pc59_val/
pc459_val/
Model | A-847 | A-150 | PC-459 | PC-59 | PAS-20 | Weights |
---|---|---|---|---|---|---|
MAFTP-Base | 13.8 | 34.6 | 16.2 | 57.5 | 95.4 | maftp_b.pth |
MAFTP-Large | 15.1 | 36.1 | 21.6 | 59.4 | 96.5 | maftp_l.pth |
PQ | SQ | RQ | Weights | |
---|---|---|---|---|
MAFTP-Large | 27.1 | 73.5 | 32.9 | maftp_l_pano.pth |
evaluate trained model on validation sets of all datasets.
python train_net.py --eval-only --config-file <CONFIG_FILE> --num-gpus <NUM_GPU> OUTPUT_DIR <OUTPUT_PATH> MODEL.WEIGHTS <TRAINED_MODEL_PATH>
For example, evaluate our pre-trained maftp_l.pth
model:
# 1. Download MAFTP-Large.
# 2. put it at `out/semantic/MAFT_Plus/maftp_l.pth`.
# 3. evaluation
python train_net.py --config-file configs/semantic/eval.yaml --num-gpus 8 --eval-only \
MODEL.WEIGHTS out/semantic/MAFT_Plus/maftp_l.pth
end to end training requires 8*A100 GPUs and 14 hours, approximately:
# MAFT-Plus-Large (maftp-l)
python train_net.py --config-file configs/semantic/train_semantic_large.yaml --num-gpus 8
# MAFT-Plus-Base (maftp-b)
python train_net.py --config-file configs/semantic/train_semantic_base.yaml --num-gpus 8
We provide demo/demo.py
that is able to demo builtin configs. Run it with:
python demo/demo.py \
--input input1.jpg input2.jpg \
[--other-options]
--opts MODEL.WEIGHTS /path/to/checkpoint_file
For example, evaluate our pre-trained maftp_l.pth
model:
# 1. Download MAFTP-Large.
# 2. put it at `out/semantic/MAFT_Plus/maftp_l.pth`.
# 3. run demo:
python demo/demo.py --input im.png
If this codebase is useful to you, please consider citing:
@inproceedings{jiao2024collaborative,
title={Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation},
author={Jiao, Siyu and Zhu, Hongguang and Huang, Jiannan and Zhao, Yao and Wei, Yunchao and Humphrey, Shi},
booktitle={European Conference on Computer Vision},
year={2024},
}