[152] Sigmoid Loss for Language Image Pre-Training

paper, code

TL;DR

I read this because.. : CLIPScore 관련해서 SigLIP의 score는 softmax로 학습된 것과 많이 다르려나? loss 부분과 효과가 궁금해서 읽음
task : CLIP
problem : InfoNCE 함수 내에 들어가는 softmax가 학습적으로 불안정하며, 분모의 negative pair를 합하는 과정에 all-gather가 들어가는데 그게 학습 비효율을 야기시킴.
idea : sigmoid loss 제안. 아래 좀 더 상술
input/output : {image, text} -> score
architecture : ViT-B/16, (LiT 셋팅) ViT-B/8, ViT-g/14
objective : Sigmoid Loss
baseline : CLIP, OpenCLIP, EVA-CLIP, CLIPA-v2
data : WebLI dataset using only English image and text pairs
evaluation : ImageNet-1k / COCO R@1
result : 비교군보다 좋은 성능. 데이터가 다르긴 함. ㅋㅋ 자세히 못봤지만 step수는 맞췄겠징..
contribution : sigmoid loss 제안. 다양한 ablation 실험.
etc. :

Details

Sigmoid Loss

기존 InfoNCE

여기서 image -> text / text -> image를 위해 summation이 각각의 axis로 두 번 이루어진다는 점.

제안한 sigmoid loss. 여기서 $z_{ij}$는 positivie일 때 1 negative일때 -1인 label. negative가 너무 많기 때문에 imbalance를 해결하기 위해 $t'$, $b$를 두었고 이는 log10과 -10으로 초기화 함.

언뜻 보면 negative 다 계산해야되기 때문에 softmax 연산이랑 차이가 있나 싶다만

이런 식으로 chunking을 하면 softmax의 경우 분모분을 계산하기 위해 feature를 all_gather하는 게 필요함. 그런데 sigmoid loss의 경우 negative pair가 loss에는 들어가지만 positive pair에 대해 negative pair가 필요한건 아니기 때문에 그냥 chunking해서 forward 하면 되어서 더 효율적임.