Closed baibaidj closed 2 years ago
I assume the approach you mentioned should work similarly well. But we did not ablate them thoroughly. Using mask embeddings but not masked patches is inspired from what was used for BERT (masked language modeling) that it does not spell "[mask]" to form BPE tokens, but uses learnt embeddings instead.
I could provide something from my perspective. If performing masking directly on the images, one solution is to let the masked area be color of black. But the disadvantage of this approach is that the network would be confused about the black color that it indicates whether masked area or the black color itself. The current solution borrowing from BERT to replace the masked area as the learnable mask token would make more sense.
I assume the approach you mentioned should work similarly well. But we did not ablate them thoroughly. Using mask embeddings but not masked patches is inspired from what was used for BERT (masked language modeling) that it does not spell "[mask]" to form BPE tokens, but uses learnt embeddings instead.
I see. Thanks.
I could provide something from my perspective. If performing masking directly on the images, one solution is to let the masked area be color of black. But the disadvantage of this approach is that the network would be confused about the black color that it indicates whether masked area or the black color itself. The current solution borrowing from BERT to replace the masked area as the learnable mask token would make more sense.
This perspective is interesting. But my intuition is that the learnable mask token can also be used at the input layer. I don't think masked area and black color is in the same category to contrast with each other. The former refers to a region of 0s while the later to a vector of 0s. A region of 0s (as a mask) is fundamentally different than sparse 0s (in normal images). I guess if we choose to set masked area as values other than 0, say 1.0, the effect would be the same. The essential effect of mask is to set a region to single value and let the region lose its meaningful signal.
What does x = x (1. - w) + mask_tokens w mean? It's very confusing for me.
Hi! This is a wonderful work. After reading the paper, my intuition is that the input image will be masked for reconstruction. Then, while I read the code, I was surprised to find that you did not mask the input image directly, rather you masked the features maps output by Patch_Embed. See here https://github.com/microsoft/SimMIM/blob/519ae7b0999b9d720daa61e3848cd41b8fbd9978/models/simmim.py#L36
I was wondering what is the reasoning behind this approach and how is it different from the more intuitive one of masking the original image directly. Thank you.