naver-ai / cl-vs-mim

(ICLR 2023) Official PyTorch implementation of "What Do Self-Supervised Vision Transformers Learn?"
Other
101 stars 8 forks source link

Code for the "simple linear combination of CL (MoCo) and MIM (SimMIM) objectives" #4

Open EddieAy opened 1 year ago

EddieAy commented 1 year ago

Can you provide the code to reproduce the results of this ? Thank you!

image

ARE THE TWO METHODS COMPLEMENTARY TO EACH OTHER?
We present comparative analyses on CL and MIM from three perspectives: self-attentions, representation transforms, and the position of important layers. All of our results indicate that CL and
MIM train ViTs differently. These differences naturally imply that combining CL and MIM to train a
backbone may help leverage the advantages of both methods

To show that CL and MIM are complementary, we introduce the simplest way to harmonize CL
and MIM by linearly combining two losses, i.e., L = (1 − λ)LMIM + λLCL where LMIM and LCL
each indicate the losses of MIM and CL, and λ is the importance weight of CL
xxxnell commented 1 year ago

Hi @EddieAy, thank you for reaching out.

I'm currently working on a more advanced version of the hybrid model, but it will take some time before I can release it as open source. I'll need to look into whether I can distribute the original code, but in the meantime, let me try to describe the overall architecture with some pseudo-code:

base = ViT(...)
decoder = ViT(...)
momentum = ViT(...)

simmim = SimMIM(encoder=base, decoder=decoder, ...)
moco = MoCo(base_encoder=base, momentum_encoder=momentum, ...)

...

loss_mim = simmim(x1, mask)
loss_cl = moco(x2, x3, momentum)
loss = (1.0 - lmda) * loss_mim + lmda * loss_cl

The code is based on SimMIM.

In summary, the hybrid model architecture consisted of a single ViT encoder backbone (base) and a momentum with three heads attached to them: one head for the SimMIM (decoder), and two heads for the MoCo.