Open EddieAy opened 1 year ago
Hi @EddieAy, thank you for reaching out.
I'm currently working on a more advanced version of the hybrid model, but it will take some time before I can release it as open source. I'll need to look into whether I can distribute the original code, but in the meantime, let me try to describe the overall architecture with some pseudo-code:
base = ViT(...)
decoder = ViT(...)
momentum = ViT(...)
simmim = SimMIM(encoder=base, decoder=decoder, ...)
moco = MoCo(base_encoder=base, momentum_encoder=momentum, ...)
...
loss_mim = simmim(x1, mask)
loss_cl = moco(x2, x3, momentum)
loss = (1.0 - lmda) * loss_mim + lmda * loss_cl
The code is based on SimMIM.
In summary, the hybrid model architecture consisted of a single ViT encoder backbone (base
) and a momentum
with three heads attached to them: one head for the SimMIM (decoder
), and two heads for the MoCo.
Can you provide the code to reproduce the results of this ? Thank you!