Closed Hugo-cell111 closed 1 year ago
Hi @Hugo-cell111,
Thank you for your interest in our work! You are right that the masked patches (patches with RGB input values [0, 0, 0]) will be processed by the convolutions and could introduce a distribution shift. However, the advantage of making the network to learn context relations (as detailed in the paper) outweighs the disadvantage of a potential distribution shift as can be seen in the reported improvements by using MIC.
Best, Lukas
Hi! In traditional MIM research, Transformer-based structures are frequently used because the unit of an image is a patch, not a pixel. But it is a little contradictory to the convolutional network such as ResNet, because CNN is based on pixel-wise convolution not in a patch-wise style. Plus, the masked patches participate in the convolution, which will lead to the distribution shift and ignored pixels will introduce irrelavant information. So I wonder how can we adapt CNN to MIM and I am quite looking forward to your reply. Thanks!