CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation (CVPR 2023)

images

:tada: :tada: :tada: News

2023/12/09 Our new paper TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training is accepted by AAAI 2024. It can generate image-level labels based on frozen CLIP and can realize annotation-free semantic segmentation without any training when combining with CLIP-ES.
2023/2/28 Our paper is accepted by CVPR 2023.

Reqirements

# create conda env
conda create -n clip-es python=3.9
conda activate clip-es

# install packages
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html
pip install opencv-python ftfy regex tqdm ttach tensorboard lxml cython

# install pydensecrf from source
git clone https://github.com/lucasb-eyer/pydensecrf
cd pydensecrf
python setup.py install

Preparing Datasets

PASCAL VOC2012

Download images in PASCAL VOC2012 dataset at here and the train_aug groundtruth at here. The structure of /your_home_dir/datasets/VOC2012should be organized as follows:

---VOC2012/
       --Annotations
       --ImageSets
       --JPEGImages
       --SegmentationClass
       --SegmentationClassAug

MS COCO2014

Download MS COCO images from the official website. Download semantic segmentation annotations for the MS COCO dataset at here. The structure of /your_home_dir/datasets/COCO2014are suggested to be organized as follows:

---COCO2014/
       --Annotations
       --JPEGImages
           -train2014
           -val2014
       --SegmentationClass

Preparing pre-trained model

Download CLIP pre-trained [ViT-B/16] at here and put it to /your_home_dir/pretrained_models/clip.

Usage

Step 1. Generate CAMs for train (train_aug) set.

# For VOC12
CUDA_VISIBLE_DEVICES=0 python generate_cams_voc12.py --img_root /your_home_dir/datasets/VOC2012/JPEGImages --split_file ./voc12/train_aug.txt --model /your_home_dir/pretrained_models/clip/ViT-B-16.pt --num_workers 1 --cam_out_dir ./output/voc12/cams

# For COCO14
CUDA_VISIBLE_DEVICES=0 python generate_cams_coco14.py --img_root /your_home_dir/datasets/COCO2014/JPEGImages/train2014 --split_file ./coco14/train.txt --model /your_home_dir/pretrained_models/clip/ViT-B-16.pt --num_workers 1 --cam_out_dir ./output/coco14/cams

Step 2. Evaluate generated CAMs and use CRF to postprocess

# (optional) evaluate generated CAMs
## for VOC12
python eval_cam.py --cam_out_dir ./output/voc12/cams --cam_type attn_highres --gt_root /your_home_dir/datasets/VOC2012/SegmentationClassAug --split_file ./voc12/train.txt
## for COCO14
python eval_cam.py --cam_out_dir ./output/coco14/cams --cam_type attn_highres --gt_root /your_home_dir/datasets/COCO2014/SegmentationClass --split_file ./coco14/train.txt

# use CRF process to generate pseudo masks 
(realize confidence-guided loss by setting pixels with low confidence to 255)
## for VOC12 
python eval_cam_with_crf.py --cam_out_dir ./output/voc12/cams --gt_root /your_home_dir/datasets/VOC2012/SegmentationClassAug --image_root /your_home_dir/datasets/VOC2012/JPEGImages --split_file ./voc12/train_aug.txt --pseudo_mask_save_path ./output/voc12/pseudo_masks
## for COCO14
python eval_cam_with_crf.py --cam_out_dir ./output/coco14/cams --gt_root /your_home_dir/datasets/COCO2014/SegmentationClass --image_root /your_home_dir/datasets/COCO2014/JPEGImages/train2014 --split_file ./coco14/train.txt --pseudo_mask_save_path ./output/coco2014/pseudo_masks

# eval CRF processed pseudo masks
## for VOC12 
python eval_cam_with_crf.py --cam_out_dir ./output/voc12/cams --gt_root /your_home_dir/datasets/VOC2012/SegmentationClassAug --image_root /your_home_dir/datasets/VOC2012/JPEGImages --split_file ./voc12/train_aug.txt --eval_only
## for COCO14
python eval_cam_with_crf.py --cam_out_dir ./output/coco14/cams --gt_root /your_home_dir/datasets/COCO2014/SegmentationClass --image_root /your_home_dir/datasets/COCO2014/JPEGImages/train2014 --split_file ./coco14/train.txt --eval_only

The generated pseudo masks of VOC12 and COCO14 can be found at Google Drive.

Step 3. Train Segmentation Model

To train DeepLab-v2, we refer to deeplab-pytorch. The ImageNet pre-trained model can be found in AdvCAM.

Results

The quality of generated pseudo masks on PASCAL VOC2012 train set.

Method	CAMs	+CRF
CLIP-ES	70.8	75.0

Segmentation results on PASCAL VOC2012 val and test sets.

Method	Network	Pretrained	val	test
CLIP-ES	DeepLabV2	ImageNet	71.1	71.4
CLIP-ES	DeepLabV2	COCO	73.8	73.9

Segmentation results on MS COCO2014 val set.

Method	Network	Pretrained	val
CLIP-ES	DeepLabV2	ImageNet	45.4

Acknowledgement

We borrowed the code from CLIP and pytorch_grad_cam. Thanks for their wonderful works.

Citation

If you find this project helpful for your research, please consider citing the following BibTeX entry.

@InProceedings{Lin_2023_CVPR,
    author    = {Lin, Yuqi and Chen, Minghao and Wang, Wenxiao and Wu, Boxi and Li, Ke and Lin, Binbin and Liu, Haifeng and He, Xiaofei},
    title     = {CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {15305-15314}
}

linyq2117 / CLIP-ES

readme