pomelyu / paper-reading-notes

0 stars 0 forks source link

2021 (CLIP) Learning Transferable Visual Models From Natural Language Supervision #1

Open pomelyu opened 1 year ago

pomelyu commented 1 year ago

Introduction

In NLP, task-agnostic objectives, such as autoregressive and masked language modeling trained on large dataset have enabled zero-shot transfer to the downstream task. CLIP is an image caption model trained on large(400M) image-caption paired dataset. After the training, the learned image-text embedding can be transferred to the other tasks, for example image classification.

Method

image
# image_encoder - ResNet or Vision Transformer
# text_encoder  - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l]       - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t             - learned temperature parameter
# extract feature representations of each modality I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T)  #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss   = (loss_i + loss_t)/2

Highlight

Limitation

Comments

[^1]: Contrastive Multiview Coding