[arXiv 2021] CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

uhhyunjoo commented 2 years ago

link
paper	CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
code	papers with code

uhhyunjoo commented 2 years ago

Abstract

Video-text retrieval 을 위한 모델, CLIP4Clip 을 제안함
CLIP : image-language pre-training 모델, image-text dataset 을 이용하여 visual concecpts 를 학습한 모델이 강력함을 보여줌
CLIP4Clip : CLIP 을 video-language retrieval 로 knowledge transfer 한, end-to-end 방식의 모델
여러 의문들에 대해 조사하기 위해 empirical studies 를 수행함
- image feature 를 사용해서 video-text retrieval 를 하는 게 충분한지? → ⭕
- CLIP4Clip 을 대규모 video-text 데이터셋으로 post-pretraining 하는 건 성능에 어떤 영향을 주는지? → 👍
- 비디오의 frames 간의 temporal dependency 를 모델링하는 실용적인 방식이 무엇인지? → {'linear projection' : 3D, 'similarity calculator' : sequential type}
- Video-text retrieval task 를 위한 model 의 hyperparameters sensitivity 는 어떤지? → learning rate sensitivity
CLIP4Clip 이 다양한 video-text retrieval dataset 에서 sota 를 달성했음

uhhyunjoo commented 2 years ago

CLIP4Clip Framework

목표 : N개의 (text, video) 쌍을 이용해서, video 와 text 간의 similarity function 인 s(v, t) 를 학습하는 것
Task : Text-to-Video retrieval, Video-To-retrieval

Video Encdoer, Text Encoder, Similarity Calculator 이렇게 세 부분으로 순서대로 설명해보겠당

Video Encoder

Video 로부터 Video representation 을 얻는 부분이다!
Video Encoder 는 ViT-B/32 를 사용했다.
즉, pretrained CLIP(ViT-B/32) 를 backbone 으로 사용하여, image representation 을 video rerpesentation 으로 transfer 하였다.
1. Video clip 으로부터 frames 를 extract 한다.
2. Video Encoder 로 frames 를 encode 하여, a sequence of features 를 얻는다.
  - frames 에서 non-overlapping image patches 를 extract 한 후, a linear projection 을 통해 1D tokens 을 만듦
  - transformer 의 입력으로 1D tokens 을 취해, 각 patch 간의 interaction 을 모델링한 final representation 을 얻음

이때 두 가지 종류의 projection (2D linear, 3D linear)를 사용하고 비교해보았는데, 2D 는 frames 간의 temporal information 을 무시하기 때문에 temporal feature extraction 을 enhance 시키기 위해 3D 를 도입했다.

그런데 LSMDC 빼고는 성능이 3D 가 더 안 좋음... 왜지 ? ? 연구진의 추측 : CLIP이 3D linear 가 아닌 2D linear 로 학습되었고, 3D linear 에 대한 discrepant initialization 이 temporal information 을 학습하기 힘들게 만들었다. 추후 연구에서 대규모 video-text datset 으로 pretrain 해볼 것이다.

오늘의 단어

[ ] discrepant : 서로 어긋나는, 모순된, 앞뒤가 안 맞는(inconsistent)

uhhyunjoo / paper-notes

[arXiv 2021] CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval #5

Abstract

CLIP4Clip Framework

Video Encoder