[ICCV 2019] HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

uhhyunjoo commented 2 years ago

link
paper	HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
code	papers with code
etc	official web page

uhhyunjoo commented 2 years ago

Abstract

HowTo100M 이라는 이름의 text-video dataset 을 제안함
데이터 수집 시 빠르게 노가다 없이 만든 a large scale dataset
해당 데이터셋으로 학습시킨 text-video embedding 모델이 text-to-video retrieval, action localization 에서 sota 를 달성
다른 도메인으로 transfer 잘 됨 : HowTo100M 으로 pre-training 후 다른 도메인 데이터셋으로 fine-tuning 시 성능 더 잘 나옴

uhhyunjoo commented 2 years ago

Dataset

WikiHow 로부터 activities 에 얻은 후 추상적 카테고리 제외, non-physical 동사 제외 → visual task 만 가져옴
Youtube 에 'how to ${task_name}' 으로 검색해서 (ex. how to paint furniture) 영어 자막이 있는 비디오들만 가져옴
이때 view 수 100 미만, 자막 내 단어 수 100 미만, 영상 길이 2000초 초과, Youtube ID 중복인 비디오들은 제외
caption : 자막의 각 line
clip : 비디오에서 caption 의 time interval 에 해당되는 부분
평균적으로, 하나의 비디오가 100개의 clip-caption pairs 생성, 하나의 clip 길이는 4초, 하나의 caption 에는 4 words 포함
clip-caption 을 manually annotated 한 게 아니라 자막 기반으로 만든 거라 'weakly-paired dataset' 이라고 함
다른 video description dataset 과 비교해봤을 때 데이터 수집 과정이 효율적이고 대규모임

Table 1

Model

clip-caption dataset 으로부터 a joint text-video embedding 을 학습하기 위한 모델
V → v → f(v)
- video clip 으로부터 추출한 2D feature 와 3D feature 를 concat 하여 video feature v 생성, f 를 이용하여 d 차원으로 embed
- 2D CNN : ImageNet pre-trained ResNet-152 (rate : 1 frame per second) → frame-level features
- 3D CNN : Kinetics pre-trained ResNeXt-101 16-frames model (1.5 features per second) → video-level features
C → c → g(c)
- caption 에서 stop-words 버린 후 각 word 에 Word2Vec 를 적용하여 caption feature c 생성, g 를 이용하여 d 차원으로 embed
- Word2Vector : GoogleNews pre-trained word2vec embedding model → word representation

Embedding function

f, g : 이전 논문에서 사용한 non-linear 한 형태의 함수 사용

Similarity

consine similarity 사용

Loss

하나의 batch 에는 B 개의 clip-caption pairs 포함
max-margin ranking loss 사용

Sampling strategy

clip-caption pair i 에 대해 negative 한 clip-caption pairs N(i) 를 정의하기 위해 intra-negative sampling 적용
절반은 동일한 Youtube Video 에서 가져온 clip-caption pair (clip-pair i 랑 동일한 건 제외!) → intra-negative sampling
절반은 다른 Youtube Video 에서 가져온 clip-caption pair → inter-negative sampling
학습한 embedding 이 video-clip 의 irrelevant background features 가 아니라, relevant 한 부분에 집중하고 있음을 ensure 하기 위함
Appendix 에 positive pair sampling strategy 에 대한 empirical analysis 도 존재

Experiments

Table 3

same video 로부터 가져온 negatives (intra-negatives) 를 사용하는 게 이득임
MSR-VTT, LSMDC 보다, fine-grained 한 dataset 인 YouCook2, CrossTask 에서 크게 성능 향상

Table 4

step localization 을 위해 designed 된 게 아님에도 불구하고 sota 를 능가함
학습된 model 이 특정 도메인에 편향되지 않았음을 볼 수 있음
small & carefully annotated dataset 보다 large & weakly-paired dataset 이 training set 으로 적합

Table 5

YouCook2 < HowTo100M < {PT: HowTo100M, FT : YouCook2}
HowTo100M 이 내용적인 측면에서 YouCook2 와 유사하고, YouCook2 보다 HowTo100M 의 스케일이 크기 때문이라고 분석
YouCook2 는 cooking 관련 instructional videos

Table 6

HowTo100M < MSR-VTT < {PT: HowTo100M, FT : MSR-VTT}
HowTo100M 이 내용적인 측면에서 MSR-VTT 와는 다르기 때문이라고 분석
MSR-VTT 는 generic Youtube videos

training dataset 의 양을 늘릴 수록 높은 성능을 달성
saturation 현상이 관찰되지 않았기 때문에, 데이터를 더 수집함으로써 추가적인 개선이 기대됨

saturation 이란, activation function 의 gradient 가 0에 가까워져 weight 가 더 이상 update 되지 않는 현상을 의미함. (일종의 gradient vanishing) 해당 논문에서 activation function 으로 사용하는 sigmoid function 은 saturation 현상이 발생할 수 있다는 단점을 갖고 있음.

🤔 데이터의 양과 saturation 문제는 직접적인 관련이 있는 것인가? gradient vanishing 은 layer 갯수와 더 직접적인 관련이 있는 게 아녔나? 그리고 해당 모델은 왜 relu 같은 함수가 아닌, saturation 이 발생할 수 있는 sigmoid 를 사용하는가?

{PT: HowTo100M, FT : MSR-VTT % } 에서 MSR-VTT 의 양을 증가시킬 수록 성능 향상
MSR-VTT 만을 이용해 학습한 당시 sota 성능을 {PT : HowTo100M, FT : MSR-VTT-20%} 가 달성하고, MSR-VTT 더 사용하면 능가

Table 7

HowTo100M < LSMDC < {PT: HowTo100M, FT : LSMDC}
HowTo100M 이 내용적인 측면에서 LSMDC 와는 다르기 때문이라고 분석
LSMDC 는 movie clips

PT, FT 데이터셋 바꾸면서 Cross-daset 평가 진행
PT : HowTo100M 일 때 성능이 가장 좋았음

quaitative results

uhhyunjoo commented 2 years ago

추가로 봐야할 것들

embedding model 관련 논문

[arXiv 2018] Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

max-margin ranking loss 관련 논문

negative sampling strategy 관련 논문

[ICCV 2017] Localizing Moments in Video With Natural Language

Appendix 에 있는 positive pair sampling strategy 에 대한 empirical analysis

Supplementary Material

uhhyunjoo commented 2 years ago

ViT 모델 디테일

patch size 를 줄이면, effective sequence length 가 늘어난다.
이를 통해, 새로운 parameters 를 introducing 하지 않고서도 ,surprisingly robust improvements 를 보인다.
이는, compute 가 the number of parameters 보다 더 나은 performance predictor 라는 것을 suggest 하고, 설령 그렇다 하더라도 scaling 은 width 에 대한 depth 를 강조해야할 것을 suggest 한다.
```
These findings suggest that compute might be a better predictor of performance than the number of parameters, and that scaling should emphasize depth over width if any.
```

이 그림에서, patches 들의 수가 sequence length 를 뜻한다...!

uhhyunjoo commented 2 years ago

Feature 관련

2D CNN : ImageNet pre-trained ResNet-152 (rate : 1 frame per second) → frame-level features (2048 차원)

3D CNN : Kinetics pre-trained ResNeXt-101 16-frames model (1.5 features per second) → video-level features (2048 차원)

def __getitem__(self, idx):
# breakpoint()
vid = self.csv['video_id'].values[idx] # id
rind = random.randint(0, len(self.sentences[vid]) - 1)
sentence = self.sentences[vid][rind]
feat_2d = F.normalize(self.features['2d'][vid].float(), dim=0)
feat_3d = F.normalize(self.features['3d'][vid].float(), dim=0)
###
# print('feat:', self.features['2d'][vid].float().shape, self.features['3d'][vid].float().shape)
# feat: torch.Size([2048]) torch.Size([2048])
###
video = th.cat((feat_2d, feat_3d))
caption = self._words_to_we(self._tokenize_text(sentence))

# print(feat_2d.shape, feat_3d.shape, video.shape, caption.shape)
# torch.Size([2048]) torch.Size([2048]) torch.Size([4096]) torch.Size([20, 300])

# print(sentence, self._tokenize_text(sentence), caption.shape)
# a rock band is performng at a concert in a club
# ['a', 'rock', 'band', 'is', 'performng', 'at', 'a', 'concert', 'in', 'a', 'club']
# torch.Size([20, 300])

return {'video': video, 'text': caption}

uhhyunjoo / paper-notes