Generalizes Masked Language Modeling (MLM) objective from BERT to continuous domain, like images. Uses Contrastive Predicting Loss
What is MLM:
Predict [mask] in: I am going to [MASK] where I will learn mathematics from Prof. Ganguly.
Use a classifcation loss for prediction: college
Task used
Similar to BERT, we mask out a few patches in an image and try to reconstruct the original image. To enable the classification loss, we sample “distractor” patches from the same image, and ask the model to classify the right patch to fill in a target masked location.
On the encoder side, the output vectors produced by P are routed into the attention pooling network to summarize these representations into a single vector u.
On the decoder side, P creates output vectors h1,h2,h3 (for the three patches hidden from encoder). The decoder then queries the encoder by adding to the output vector u the location embedding of a patch, selected at random among the patches in the decoder (e.g.,location 4[in the image]) to create a vector v.
The vector v is then used in a dot product to compute the similarity between v and each h. Having seen the dot products between v and h’s, the decoder has to decide which patch is most relevant to fill in the chosen location (at location4 in the above image). The cross entropy loss is applied for this classification task, whereas the encoder and decoder are trained jointly with gradients back-propagated from this loss.
Selfie: Self-supervised Pretraining for Image Embedding
What is MLM:
Predict [mask] in:
I am going to [MASK] where I will learn mathematics from Prof. Ganguly.
Use a classifcation loss for prediction:college
Task used
P
are routed into the attention pooling network to summarize these representations into a single vectoru
.u
the location embedding of a patch, selected at random among the patches in the decoder (e.g.,location 4[in the image]) to create a vectorv
.v
is then used in a dot product to compute the similarity betweenv
and eachh
. Having seen the dot products betweenv
andh’s
, the decoder has to decide which patch is most relevant to fill in the chosen location (at location4 in the above image). The cross entropy loss is applied for this classification task, whereas the encoder and decoder are trained jointly with gradients back-propagated from this loss.