lhoyer / MIC

[CVPR23] Official Implementation of MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation
261 stars 40 forks source link

Question about the ResNet in MIM #36

Closed Hugo-cell111 closed 1 year ago

Hugo-cell111 commented 1 year ago

Hi! In traditional MIM research, Transformer-based structures are frequently used because the unit of an image is a patch, not a pixel. But it is a little contradictory to the convolutional network such as ResNet, because CNN is based on pixel-wise convolution not in a patch-wise style. Plus, the masked patches participate in the convolution, which will lead to the distribution shift and ignored pixels will introduce irrelavant information. So I wonder how can we adapt CNN to MIM and I am quite looking forward to your reply. Thanks!

lhoyer commented 1 year ago

Hi @Hugo-cell111,

Thank you for your interest in our work! You are right that the masked patches (patches with RGB input values [0, 0, 0]) will be processed by the convolutions and could introduce a distribution shift. However, the advantage of making the network to learn context relations (as detailed in the paper) outweighs the disadvantage of a potential distribution shift as can be seen in the reported improvements by using MIC.

Best, Lukas