microsoft / SimMIM

This is an official implementation for "SimMIM: A Simple Framework for Masked Image Modeling".
https://arxiv.org/abs/2111.09886
MIT License
917 stars 86 forks source link

why masking embedding feature maps but not input images #10

Closed baibaidj closed 2 years ago

baibaidj commented 2 years ago

Hi! This is a wonderful work. After reading the paper, my intuition is that the input image will be masked for reconstruction. Then, while I read the code, I was surprised to find that you did not mask the input image directly, rather you masked the features maps output by Patch_Embed. See here https://github.com/microsoft/SimMIM/blob/519ae7b0999b9d720daa61e3848cd41b8fbd9978/models/simmim.py#L36

I was wondering what is the reasoning behind this approach and how is it different from the more intuitive one of masking the original image directly. Thank you.

ancientmooner commented 2 years ago

I assume the approach you mentioned should work similarly well. But we did not ablate them thoroughly. Using mask embeddings but not masked patches is inspired from what was used for BERT (masked language modeling) that it does not spell "[mask]" to form BPE tokens, but uses learnt embeddings instead.

caoyue10 commented 2 years ago

I could provide something from my perspective. If performing masking directly on the images, one solution is to let the masked area be color of black. But the disadvantage of this approach is that the network would be confused about the black color that it indicates whether masked area or the black color itself. The current solution borrowing from BERT to replace the masked area as the learnable mask token would make more sense.

baibaidj commented 2 years ago

I assume the approach you mentioned should work similarly well. But we did not ablate them thoroughly. Using mask embeddings but not masked patches is inspired from what was used for BERT (masked language modeling) that it does not spell "[mask]" to form BPE tokens, but uses learnt embeddings instead.

I see. Thanks.

baibaidj commented 2 years ago

I could provide something from my perspective. If performing masking directly on the images, one solution is to let the masked area be color of black. But the disadvantage of this approach is that the network would be confused about the black color that it indicates whether masked area or the black color itself. The current solution borrowing from BERT to replace the masked area as the learnable mask token would make more sense.

This perspective is interesting. But my intuition is that the learnable mask token can also be used at the input layer. I don't think masked area and black color is in the same category to contrast with each other. The former refers to a region of 0s while the later to a vector of 0s. A region of 0s (as a mask) is fundamentally different than sparse 0s (in normal images). I guess if we choose to set masked area as values other than 0, say 1.0, the effect would be the same. The essential effect of mask is to set a region to single value and let the region lose its meaningful signal.

weiyeeah commented 2 years ago

What does x = x (1. - w) + mask_tokens w mean? It's very confusing for me.