[26] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Abstract

CLIP에서의 original image-text matching 문제를 pixel-text matching problem으로 접근하고 pixel-text score maps를 사용해서 dense prediction model을 학습시킴.
이미지에 대한 contextual information을 활용해 Language model을 prompt 함으로써 pre-trained knowledge를 더 잘 활용할 수 있음.
Model-agnostic한 방법이기 때문에 CLIP + ImageNet pre-trained models를 사용하는 다양한 pre-trained visaul backbones, arbitrary dense prediction systems에 모두 적용될 수 있음.

Introduction

이 논문은 어떻게 Pre-trained CLIP 모델을 잘 fine-tuning해서 dense prediction tasks를 수행할지에 대한 연구임.

Q. Whether the impressive ability of CLIP can be transferred to more complex vision tasks like dense prediction?

Conventional ImageNet pre-trained model의 경우 upstream task는 contrastive pre-training, downstream task는 per-pixel prediction task임.
- 전자는 image-text 둘 다에 해당하는 instance-level representation을 포함한다면, 후자의 경우에는 오직 pixel level에서의 visual information만 기반으로 하고 있다는 문제가 있음.
- 이를 해결하기 위해 본 논문에서는 new language-guided dense prediction framework, DenseCLIP을 제안함.
- Fig1처럼 CLIP 모델을 downstream datasets에 fine-tuning함으로써 explicit/implicit 하게 pretrained CLIP knowledge를 활용해서 다양한 Dense prediction task를 수행 가능함.
Motivation
1. Vision-language pre-trained model을 dense prediction task에 활용하는 연구가 활발히 진행되지 않았음. 단순히 pre-trained 2D backbone과 같은 이미지 인코더만 사용할 것이 아니라 텍스트 인코더에 포함됨 language priors 또한 매우 중요함.
2. Upstream task(contrastive pre-training)와 Downstream task(per-pixel prediction)와의 갭차이로 인해 dense prediction task에 knowledge transfer하기가 훨씬 더 어려움. Dense prediction task에서는 upsetream task의 경우 이미지와 텍스트 둘 다에 해당하는 instance-level representation을 고려하지만 downstream task의 경우 visual information만 가지고 pixel-level 예측을 수행해야 하기 때문.
  Method
  
  우선 important finding 중 하나는, global image feature과 별개로 CLIP image encoder의 마지막 layer에서 language-compatible feature map을 뽑을 수 있다는 것임.
이에 대한 예시로 ResNet encoder에서 총 4 stages가 있으면, CLIP의 경우 오리지널 ResNet과는 달리 attention pooling layer가 추가됨. 마지막 4번째 stage에 해당하는 feature map에 대해 global average pooling 연산을 수행하여 1XC 차원의 global feature를 얻어서 image embedding과 concat함. 그리고 multi-head self-attention layer에 feed함.
standard CLIP training의 경우, global feature가 이미지 인코더의 아웃풋으로 쓰임(다른 output인 z들은 무시). 그러나 본 논문에서는 z(other outputs)가 2가지 특징을 가지는 것을 발견함.

1) Z는 여전히 sufficient spatial information을 가지기 때문에 feature map으로 쓸 수 있음.

2) MSHA는 각각의 input에 대해 symmetric하기 때문에 z는 global feature와 유사하게 작용할 수 있으며 language features와 잘 align 될 것.

-> 위 observation을 바탕으로 이미지 임베딩을 language-compatible feature map으로 활용할 수 있음. (ex. ViT와 같은 구조들에서 z는 output에서 class token을 제외한 것)

그렇다면 text features를 어떻게 구하냐

“a photo of a [CLS]” template을 CLIP text encoder에 feed시켜서 t를 뽑음.
그러고 나서 Language-compatible feature map z를 사용해서 pixel-text score maps를 계산. (^표시는 l2 normalized version을 의미함) Score maps는 pixel-text matching의 결과를 characterize함.
1. Score maps를 lower resolution의 segmentation 결과로 볼 수 있기 때문에 auxiliary segmentation loss로 학습해서 사용할 수 있음
2. Score map을 last feature map에 concat해서 explicit하게 language prior와 결합할 수 있음.
  
  이 방법론은 간단한 modifications만으로 직접적으로 다양한 segmentation이나 detection에 사용될 수 있음. (ex. input dimension 맞춰주기)

Context-Aware Prompting

CoOp과 동일한 방식으로 learnable textual context를 학습시킴.

가장 기본적인 방식으로, P는 learnable textual context를 의미하고 e_k는 k번째 클래스에 대한 word embedding을 의미함.

Vision-to-language prompting

Visual contexts에 대한 descriptions를 포함하는 것은 text를 더 정확하게 만들 수 있음.(ex. "a photo of cat" 보다 "a photo of a cat in the grass" 이런식으로 좀 더 정확한.)
따라서 어떻게 visual contexts를 잘 사용해서 text features를 refine할지를 연구함.
본 연구에서는 Transformer decoder의 cross-attention mechanism을 사용해서 vision-language간의 interaction을 모델링함.> 2가지 방식의 Context-aware prompting을 제시함.

아래 1,2, 방식은 transformer decoder의 query를 뭘로 주냐에 따라 설계 방식이 다름.

Pre-model prompting : 직접적으로 image context를 사용해서 desired text input를 생성하는 과정임. (fig 4 참고)

q는 learnable queries이고, v는 extract 된 visual context.

즉, 식 3에서 learnable textual context P를 여기선 v로 대체해서 text encoder에 피드해주는 것.
Post-model prompting : class embedding을 refine하는 과정임. (fig 4 참고)
- CoOp처럼 text features를 만들면 이걸 직접적으로 Transformer decoder의 queries로 씀.
- 이렇게 학습하게 되면 text features가 조금 더 연관있는 visual clues를 찾을 수 있음.
- 그리고 text features를 residual connection으로 업데이트함. 여기서 감마는 residual의 스케일을 조절하는 learnable paremeter.임. 감마는 매우 작은 값으로 initialized 서 text feature로부터의 language priors를 최대한 보존함.
두 varient target 모두 목적은 같지만 다음과 같은 이유로 post-model prompting을 선호함
1. post-model prompting은 efficient하다.
  - pre-model prompting은 input이 image에 dependent하기 때문에 inference시에 text encoder로의 추가적인 forward pass가 들어가게 된다.
  - 하지만 post-model prompting의 경우 학습 시의 extracted text features를 저장해놨다가 쓸 수 있기 때문에 inference 시의 overhead를 줄일 수 있다.
2. empirical results가 post의 성능이 더 좋음을 보여줌.

Instantiations

Semantic segmentation
- segmentation 시에 pixel-text score maps를 더 잘 만들기 위한 auxiliary objective를 제시함.
위 auxiliary segmentation loss는 feature map이 locality를 더 빨리 recover하게 도움.
Object detection & Instance segmentation
- Ground truth segmentation labels가 없는 경우이다.
- segmentation처럼 마찬가지로 auxiliary loss를 설계하기 위해 bounding box와 label을 써서 binary target을 만든다.
Applications to any backbone models
- 사실상 CLIP의 이미지 인코더 백본을 어떤 다른 백본으로든 바꿀 수 있다는 장점이 있음 (e.g., ImageNet pretrained models and self-supervised models).
- visual backbone과 text encoder간 강한 relation이 없더라도 backbone이 language guidance를 통해 빨리, 잘 학습할 수 있음
- 다시 말해 사전 학습된 텍스트 인코더로부터의 language priors를 써서 어떤 pre-trained image backbone이던 개선시킬 수 있다는 것이고, 이는 곧 DenseCLIP이 dense prediction을 위한 generic framework임을 보여줌
Experiments

Semantic segmentation

Ablations
- backbone은 resnet50으로 고정해서 ablation을 수행했다.
- post, pre 둘 다 성능이 오르지만 post-model prompting이 더 성능이 좋고 computationally efficient했다.
Effects of language-guided pre-training and fine-tuning.

language-guided paradigm의 potential을 보여주기 위해 ADE20K 데이터셋으로 각각 다른 pre-training, fine-tuning strategies 성능을 비교했다.
- CLIP이 vanilla fine-tuning으로 ImageNet1K pre-trained model 성능을 outperform하는 결과를 보임
- 게다가 language-guided fine-tuning을 통해 context-aware prompting을 수행하는 DenseCLIP은 ImageNet21K pre-trained model의 성능을 훨씬 능가하는 결과를 보임.
→ Language priors can largely facilitate vision models in downstream dense prediction tasks.

Object Detection and Instance Segmentation

DenseCLIP for Any Visual Backbone

Is Dense-CLIP only suitable to CLIP image encoders?

실험 결과에 의하면 DenseCLIP can also perform well with other backbones.

Although there are no strong correlations between the feature maps of the new backbone and the text features output by the CLIP text encoder, we hypothesize that if we preserve the language priors by freezing the text encoder as before, the text encoder will guide the backbone to better adapt to downstream tasks.

ResNet+Semantic FPN, SwinT+UperNet를 비교했다.

We demonstrate that our DenseCLIP can consistently improve all the baseline models notably.

Text encoder는학습때만 쓰면 되기 때문에 low-cost solution이다.

sy00n / DL_paper_review