[27] Token-Label Alignment for Vision Transformers

Abstract

While they are shown effective for vision transformers(ViTs), we identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies(e.g., CutMix).
- We empirically observe that the contributions of input tokens fluctuate as forward propagating, which might induce a different mixing ratio in the output tokens.
To address this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
- We reuse the computed attention at each layer for efficient token-label alignment, introducing only negligible additional training costs.

Introduction

We find that self-attention in ViTs causes a fluctuation of the original spatial structure.
- CNN의 경우 global label consistency가 보장되는 translation equivalence와 달리, ViT의 self-attention은 global consistency를 약화시키고 token과 label간 misalignment를 발생시킴
- 이러한 misalignment로 인해 output token의 mixing ratio가 달라지게 되기 때문에 original data mixing straties로 연산되는 training target이 부정확해질 수 있고 결과적으로 ineffective training이 됨.
이를 위해 token-label alignment(TL_Align) method for ViTs를 제안해서 학습 시 더 정확한 target을 얻음.
- 먼저 token source에 따라 각 input token에 대해 레이블을 할당함.
- input token과 transformed token 간의 correspondence를 추적하고 그것을 바탕으로 레이블을 align 함.
- 이 때 channel MLP와 layer norm은 각 token을 독립적으로 처리하기 때문에 self-attention과 residual connection 만이 input token의 presence를 변경한다고 가정한다.
- transformed tokens를 만들 때 input token의 label을 linearly mix할 때 계산된 attentions를 재사용한다.
- Class-token based classification같은 경우 tarining target으로 output class token으로 aligned label을 직접적으로 사용한다.

Method

ViT모델의 경우 input-dependent weights가 flexibility하게 만들지만, processed token과 initial token간의 mismatch를 일으키기도 한다. 이를 위해 token-label alignment 방식을 제안해서 input과 transformed token간 trace를 수행해서 aligned label을 얻는다.

구체적으로, ViT는 처음에 mixed input을 패치로 쪼개고 flatten 시켜서 토큰화한다. projection을 수행하고 positional embedding을 더해준다.

label Initialization
1. 먼저 각 token에 class에 대한 label 임베딩을 할당해준다. 이 때 label 임베딩은 class 수 C만큼에 대한 확률 분포를 나타낸다. (yi: i번째 토큰이 해당 클래스에 속할 확률, 이들의 합은 1
  - Class token
  - CutMix로 X1, X2를 섞는 경우, class token은 mixing ratio를 고려해서 initialize 됨. (즉, 클래스 j와 클래스 k에 속한 이미지를 혼합하는 경우, ˜ycls,j는 λ로 초기화되고, ˜ycls,k는 1 - λ로 초기화)
- Patch token
  - 해당 패치가 X1에서 오면 ˜yi,j = 1로 , X2에서 오면 ˜yi,k = 1로 초기화.
  - 만약 패치가 mixed images를 포함하는 경우 mixing ratio를 label로 씀.
  - For MixUp, we can simply set all label embeddings {˜yi} with ˜y,j = λ and ˜y,j = 1 − λ.
    1. Spatial Mixing layer-wise manner로 TL-align을 수행한다.
- Q,K,V를 사용해서 Attention matrix A(Q,K)를 계산함. 이 A는 토큰 간 spatial mixing을 만드는 과정임. 즉 A는 token과 label 간의 consistency를 나타냄.

Label Alignment
- 이 메트릭스 A를 사용해서 label을 alignment함.
MSA MSA연산을 head별로 수행하면, 각 attention matrix에 대한 head 별 평균을 계산해서 label alignment를 수행함.
Transformer block spatial & clannel mixing을 수행해서 token을 처리함. (Laner norm, MLP)
Hierarchical Vision Transformers, Patch merging 패치를 채널 방향으로 concatenate하지 않고, 레이블 임베딩을 추가한 후 normalization를 수행하여 레이블을 align 함

TL-Align은 각 레이어에서 레이블과 토큰을 algin해서 consistency를 유지함. 이미지의 최종 representation은 class token이나 모든 spatial token의 average pooling으로 얻음. (모델 구조마다 다름)

sy00n / DL_paper_review

[27] Token-Label Alignment for Vision Transformers #32

Abstract

Introduction

Method