linsun449 / cliper.code

This repo is the official pytorch implementation of the paper: CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation
18 stars 0 forks source link
clip open-vocabulary-semantic-segmentation stable-diffusion

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

This repo is the official pytorch implementation of the CLIPer.

:fire: News

Introduction

For further details, please check out our paper.

Installation

Please follow the code bellow to create the environment

conda create -n CLIPer python=3.9
conda activate CLIPer
pip install -r requirement.txt

Data Preparation

Please struct the datasets as follows

datasets
├── ADEchallengeData2016
│   ├── images
│   │   ├── training
│   │   │   ├── ADE_train_00000001.jpg
│   │   ├── validation
│   │   │   ├── ADE_val_00000001.jpg
│   ├── annotations
│   │   ├── training
│   │   │   ├── ADE_train_00000001.png
│   │   ├── validation
│   │   │   ├── ADE_val_00000001.png
├── coco2014
│   ├── train2014
│   │   ├── COCO_train2014_000000000009.jpg
│   ├── val2014
│   │   ├── COCO_val2014_000000000042.jpg
│   ├── coco_seg_anno
│   │   ├── 000000000009.png
├── coco2017
│   ├── train2017
│   │   ├── 000000000009.jpg
│   ├── val2017
│   │   ├── 000000000139.jpg
│   ├── stuff_anno164
│   │   ├── train2017
│   │   │   ├── 000000000009.png
│   │   ├── val2017
│   │   │   ├── 000000000139.png
├── VOCdevkit
│   ├── VOC2010
│   │   ├── JPEGImages
│   │   │   ├── 2007_000027.jpg
│   │   ├── SegmentationClassContext
│   │   │   ├──2008_000002.png
│   ├── VOC2012
│   │   ├── JPEGImages
│   │   │   ├── 2007_000027.jpg
│   │   ├── SegmentationClassAug
│   │   │   ├──2007_000032.png

Evaluation

To evaluate our CLIPer, please enter the scripts folder and run the code

# select the config file to evaluate the code
# evaluate voc dataset with background
sh sh_ovs.sh ../scripts/config/vit-l-14/ovs_voc21.yaml
# evaluate voc dataset without background
sh sh_ovs.sh ../scripts/config/vit-l-14/ovs_voc20.yaml
# ...

Results

Run the code in this repo, you should get similar results (reported in paper are shown in the parentheses) in the following table:

Encoder VOC Context Object VOC20 Contex59 Stuff ADE
ViT-B/16 65.9(65.9) 37.6(37.6) 39.3(39.0) 85.4(85.2) 41.7(41.7) 27.5(27.5) 21.4(21.4)
ViT-L/14 70.2(69.8) 38.2(38.0) 43.5(43.3) 90.0(90.0) 43.6(43.6) 29.2(28.7) 24.4(24.4)
ViT-H/14 71.0 39.7 43.2 88.9 44.3 30.7 27.5

Visualization

Citation

@misc{Sun_2024_CLIPer,
      title={CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation}, 
      author={Lin Sun and Jiale Cao and Jin Xie and Xiaoheng Jiang and Yanwei Pang},
      year={2024},
      eprint={2411.13836},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.13836}, 
}