[132] Hyperbolic Image-Text Representations

paper, code

TL;DR

I read this because.. : 언급되어. 한 이미지를 표현하는 텍스트가 여러개가 될 수 있음. 이에 대한 ambiguity?!(송강호, 남자 배우, 남자)
task : contrastive learning
problem : 한 이미지에 대해 텍스트가 표현할 때 다양한 층위에서 표현될 수 있음(개가 눈 위에 서 있다, 강아지, ㄱㅇㅇ~)
idea : CLIP의 임베딩 공간을 euclidean 공간이 아니라 hyperbolic 공간으로 옮기자
input/output : image/text -> score
architecture : CLIP과 같음
objective : contrastive + entailment loss
baseline : CLIP trained with YFCC-100M(by SLIP)
data : YFCC-100M
evaluation : image text retrieval, zs-image classification
result : 개선된 성능. 특정 이미지에서 [ROOT]에 대해 traverse 하면서 나오는 text가 점점 generic해진걸 보임.
contribution : 아마 CLIP을 hyperbolic space에서 한 첫 work?
etc. :

Details

Motivation

Arch

Lifting embeddings onto the hyperboloid

CLIP encoder를 통과하면 각각의 이미지, 텍스트 벡터는 n차원의 벡터로 나오고 여기에 origin 0벡터를 추가하는 transformation을 적용 $v =[v_{enc}, 0]\in\mathbb{R}^{n+1}$ 이 origin O의 tangent space에 들어가게 되고, 이러면 0과 내적하면 0이되는 조건을 충족하게 된다. Lorents 모델의 space 공간에 대해서만 계산하게 되면 된다. 그럴 경우에 x 벡터에 대한 exponential map(tangent space -> manifold로 투영하는 map vectors)은 아래와 같이 정리된다.