Code for the "simple linear combination of CL (MoCo) and MIM (SimMIM) objectives"

naver-ai / cl-vs-mim

(ICLR 2023) Official PyTorch implementation of "What Do Self-Supervised Vision Transformers Learn?"

Other

101 stars 8 forks source link

ARE THE TWO METHODS COMPLEMENTARY TO EACH OTHER? We present comparative analyses on CL and MIM from three perspectives: self-attentions, representation transforms, and the position of important layers. All of our results indicate that CL and MIM train ViTs differently. These differences naturally imply that combining CL and MIM to train a backbone may help leverage the advantages of both methods To show that CL and MIM are complementary, we introduce the simplest way to harmonize CL and MIM by linearly combining two losses, i.e., L = (1 − λ)LMIM + λLCL where LMIM and LCL each indicate the losses of MIM and CL, and λ is the importance weight of CL

Hi @EddieAy, thank you for reaching out.

I'm currently working on a more advanced version of the hybrid model, but it will take some time before I can release it as open source. I'll need to look into whether I can distribute the original code, but in the meantime, let me try to describe the overall architecture with some pseudo-code:

base = ViT(...)
decoder = ViT(...)
momentum = ViT(...)

simmim = SimMIM(encoder=base, decoder=decoder, ...)
moco = MoCo(base_encoder=base, momentum_encoder=momentum, ...)

...

loss_mim = simmim(x1, mask)
loss_cl = moco(x2, x3, momentum)
loss = (1.0 - lmda) * loss_mim + lmda * loss_cl

The code is based on SimMIM.

In summary, the hybrid model architecture consisted of a single ViT encoder backbone (base) and a momentum with three heads attached to them: one head for the SimMIM (decoder), and two heads for the MoCo.

naver-ai / cl-vs-mim

Code for the "simple linear combination of CL (MoCo) and MIM (SimMIM) objectives" #4