[arXiv 2022] Disentangled Representation Learning for Text-Video Retrieval

uhhyunjoo commented 2 years ago

link
paper	Disentangled Representation Learning for Text-Video Retrieval
code	papers with code

uhhyunjoo commented 2 years ago

Abstract

Text Video Retrieval (이하 TVR)의 performance 에 영향을 끼치는 가장 중요한 요소는 Cross-modality interaction, 즉, 서로 다른 모달리티 간의 상호 관계를 잘 나타내는 것인데, 이 interaction 을 계산할 때 쓰이는 구성 요소들이 성능에 어떻게 영향을 미치는 지에 대한 연구는 거의 없었다.
본 논문은 interaction paradigm 을 깊게 다루는 첫 연구이고, 이 interaction을 compuatation 하는 것은 2가지로 나뉜다.
1. interaction contents at different granularity
2. matching function to distinguish pairs with the same semantics
  - Single vector representation 과, implict intensive function 이 optimization 을 방해한다는 것을 발견하였다.
  - 이를 바탕으로, a sequential and hierarchical representation 을 포착하기 위한 a disentangled framework (DRL)를 제안한다.
3. Weighted Token-wise Interaction (WTI)
4. Channel DeCorrelation Regularization (CDCR)
  - 이를 통해 disentangled representation 을 학습할 수 있고, 여러 벤치마크에서 CLIP4Clip 의 성능을 능가하며 sota를 달성했다.

uhhyunjoo commented 2 years ago

uhhyunjoo commented 2 years ago

왼쪽 : Text Video Retrieval 을 위한 interaction method 의 전형적인 구조
오른쪽 : interaction method 의 'interaction block' 을 process flow 에 따라 6가지로 구분한 것
이때, process flow 는 "input content 의 granuality" 와 "interaction function" 이라는 두 가지 요소를 통해 구분하였다.

본 논문에서 제안하는 프레임워크 : DRL (Disentangled Representation Learning Framework)

Weighted Token-wise Interaction 모듈
- 모든 sentence tokens 와 video frame tokens 와 fully-interact 할 수 있는 lightweight token-wise interaction (e,f)
- single-vector interaction (오른쪽 그림 a,c) 와 multi-level interaction (오른쪽 그림 b) 와 비교했을 때, 해당 method 는 fine-grained clues 를 더 잘 보존할 수 있다는 장점이 있다.
- cross transformer interaction (오른쪽 그림 d)와 비교했을 때, optimization의 어려움과 compuatational overhead 를 완화시킬 수 있다는 장점이 있다.
Channel DeCorrelation Regularization 모듈
- CDCR 은 비교하는 vector 의 구성요소들 간의 중복성을 최소화 함으로써, a hierarchical representation 을 학습하는 것을 용이하게 한다.

즉, DRL 의 핵심은 a lightweight token-wise interaction 과 CDCR 을 함께 사용함으로써, TVR 에 적합한 representation 을 학습해낼 수 있다는 것이다.