[30] Learning Token-Based Representation for Image Retrieval

Abstract

To generate compact global representations while maintaining regional matching capability, we propose a unified framework to jointly learn local feature representation and aggregation.
In our framework, we first extract deep local features using CNNs. Then, we design a tokenizer module to aggregate them into a few visual tokens, each corresponding to a specific visual pattern. This helps to remove background noise, and capture more discriminative regions in the image.
Next, a refinement block is introduced to enhance the visual tokens with self-attention and cross-attention.
Finally, different visual tokens are concatenated to generate a compact global representation. The whole framework is trained end-to-end with image-level labels.

여기서는 CNN 백본을 써서 deep local representation F를 얻는다. 이러한 local features는 input image에 대해 limited receptive field를 가질 것이다. 따라서 Local Feature Self-Attention을 수행해서 context-aware local features를 만든다. 그리고 spatial attention mechanism에 의해 L개의 그룹으로 나누고 각 그룹의 local feautres를 aggregate해서 visual token t를 형성한다.

이 T를 업데이트하는 refinement block을 둔다. 최종적으로 모든 visual token들이 concat되고 차원을 줄여서 final global descriptor를 만든다. ArcFace margin loss로 학습한다.

Tokenizer

noisy backgrounds, occlusions과 같은 데이터에서는 이미지 간의 patch-level match가 중요하다. 여기서도 1x1 conv layer를 두어서 local features F에 대해 attention map을 얻는다.

Relation to GMM

Gaussian Mixture Model, GMM과 유사하다. GMM은 모든 데이터 포인트가 unknown 평균 벡터와 데이터 분산을 가진 a mixture of a finite number of Gaussian distributions 에서 생성되었다고 가정하는 확률 모델임.

Eq. (3)의 σ는 1로 설정하고 p(z = j)를 φ(kwik^2)로 설정한다. a(i)h,w는 local feature F_c h,w를 i번째 visual pattern에 soft cluster assignment로 해석할 수 있으며, 이는 p(z = j|fi)의 의미와 동일하다.

Refinement Block

Relation modeling Tokenizing 중에 다른 attention maps가 별도로 사용된다. 서로 다른 visual token 사이의 relationship을 모델링하기 위해 self-attention을 써서 relation-aware visual tokens를 만든다. visual token을 Q,K,V로 매핑하고 MHSA으로 토큰 간 silimarity S를 계산한다. head별로 계산해서 aggregate하고 learnable projection을 시킨다.
Visual token enhancement cross-attention으로부터 feature를 뽑음. F_c를 시퀀스로 펼치고 각각 다른 FC layer를 써서 Q,k,v와 매핑하고 visual token과 Original local feautres 간의 similarity를 계산한다.

sy00n / DL_paper_review