[159] Long-CLIP: Unlocking the Long-Text Capability of CLIP

TL;DR

I read this because.. : github follow 하는 분이 Star 눌러서 알게됨
task : CLIP with long context
problem : CLIP이 77 토큰 개수로 제한되게 학습되어 있고 이 중에 유효하게 사용되는 토큰은 20개이다.
idea : 긴 CLIP을 학습 하자. PE를 interpolate 하되 유효 토큰 20개는 남기고 나머지만 Interpolate 하자
input/output : {image, text} -> score
architecture : CLIP ViT-B/16, ViT-L/14
objective : infoNCE
baseline : CLIP
data : ShareGPT4V 1M
evaluation : ImageNet, COCO, FLICKR retrieval, ShareGPT4V retrieval (long context retreival)
result : 정량적으로 좋은 성능. 훨씬 context를 잘 파악하는 듯한 느낌.
contribution :
etc. :

finegrained alignment는 하던대로 하는듯
coarse grained alignment는 이미지에 PCE 알고리즘(PCA한 뒤 Top 32개 element를 남김)을 적용한뒤에 threshold 빼는건 낮춘 뒤 골라진 Eigenvector와 Eigenvalue로 weighted sum 한거와 short caption이 align 되는 형식