2021 (CLIP) Learning Transferable Visual Models From Natural Language Supervision

Introduction

In NLP, task-agnostic objectives, such as autoregressive and masked language modeling trained on large dataset have enabled zero-shot transfer to the downstream task. CLIP is an image caption model trained on large(400M) image-caption paired dataset. After the training, the learned image-text embedding can be transferred to the other tasks, for example image classification.

Method

# image_encoder - ResNet or Vision Transformer
# text_encoder  - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l]       - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t             - learned temperature parameter
# extract feature representations of each modality I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T)  #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss   = (loss_i + loss_t)/2

Highlight

CLIP didn't try to predict the exact caption but maximize the distance of the correct text and minimize the distance of the incorrect text. Previous work showed this contrastive training[^1] is more efficient than prediction training

Limitation

Comments

[^1]: Contrastive Multiview Coding

pomelyu / paper-reading-notes

2021 (CLIP) Learning Transferable Visual Models From Natural Language Supervision #1

Introduction

Method

Highlight

Limitation

Comments