Self supervised learning

Selfie: Self-supervised Pretraining for Image Embedding

https://arxiv.org/abs/1906.02940
Google Brain team
Generalizes Masked Language Modeling (MLM) objective from BERT to continuous domain, like images. Uses Contrastive Predicting Loss

What is MLM:

Predict [mask] in: I am going to [MASK] where I will learn mathematics from Prof. Ganguly. Use a classifcation loss for prediction: college

Task used

Similar to BERT, we mask out a few patches in an image and try to reconstruct the original image. To enable the classification loss, we sample “distractor” patches from the same image, and ask the model to classify the right patch to fill in a target masked location.

On the encoder side, the output vectors produced by P are routed into the attention pooling network to summarize these representations into a single vector u.
On the decoder side, P creates output vectors h1,h2,h3 (for the three patches hidden from encoder). The decoder then queries the encoder by adding to the output vector u the location embedding of a patch, selected at random among the patches in the decoder (e.g.,location 4[in the image]) to create a vector v.
The vector v is then used in a dot product to compute the similarity between v and each h. Having seen the dot products between v and h’s, the decoder has to decide which patch is most relevant to fill in the chosen location (at location4 in the above image). The cross entropy loss is applied for this classification task, whereas the encoder and decoder are trained jointly with gradients back-propagated from this loss.

nishnik / Paper-Leaf

Self supervised learning #22

Selfie: Self-supervised Pretraining for Image Embedding

What is MLM:

Task used