raoyongming / DenseCLIP

[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
505 stars 38 forks source link

Query on Inference Setting #8

Closed sauradip closed 2 years ago

sauradip commented 2 years ago

Hi,

Thanks for making the code public !

I had a general query on the inference setting chosen for this paper -- Why is this paper not targetting zero -shot setting and instead focussed on fully supervised setting ? Is there any reason ? As the power of CLIP lies in zero-shot task transfer, i was wondering why no experiments were done for this ? and instead posed this problem as a multi-modal fully supervised dense detection task ?

Thanks in advance

raoyongming commented 2 years ago

Hi,

Thanks for your interest in our work. In this work, we want to study the problem of how to use the large-scale pre-training models to various dense prediction tasks. Our method has many applications like replacing the conventional ImageNet pre-training models or unsupervised pre-training models with CLIP models. Due to the large gap between instance-level image-text pretraining and dense prediction tasks, we found that CLIP models can only obtain relatively low performance on zero-shot segmentation or detection tasks (e.g., 15.3 mIoU on ADE20k according to [2]), which is not strong enough for many application scenarios. Therefore, we focus on the fully supervised dense prediction settings where we can use more supervisions to fully exploit the power of CLIP pretraining. We also show our method can be applied to any visual backbones (see Section 4.3).

I think both the zero-shot transfer ability and the rich knowledge learned from large-scale text-image pre-training are key advantages of CLIP. Some recent papers like [1] and [2] explore the former and we study the latter property.

[1] DenseCLIP: Extract Free Dense Labels from CLIP, https://arxiv.org/abs/2112.01071 [2] A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model, https://arxiv.org/abs/2112.14757

sauradip commented 2 years ago

Thanks for your clarification ! But i am still curious about the fact that the [2] paper you cited and some followups after that say that 2-stage (1st stage propose/generate mask , 2nd stage pixels aligned corresponding to masks for Zero-shot semantic setup )is the way forward for dense tasks zero-shot task transfer ! In your implementation, looks like you also did a 2-stage approach ( text-pixel embedding fused with visual embedding passed into decoder ) for the decoding part ! Then why do you think the performance for your case still dropped for zero-shot scenario ?

raoyongming commented 2 years ago

I think one key reason is that the feature maps extracted using the pre-trained CLIP visual encoder lack locality (i.e., the feature may not precisely represent the semantic information of the corresponding patch/region). Therefore, we need to fine-tune the encoder with pixel-wise supervision to recover the locality.