[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
I am currently re-implementing SparK and stumbled over the masking token that gets re-introduced during densifying.
https://github.com/keyu-tian/SparK/blob/a63e386f8e5186bc07ad7fce86e06b08f48a61ea/pretrain/spark.py#L99C1-L110C9
I was wondering if you tested how important the Mask Tokens are or if you have any intuition for what the utility of them are. For Transformers I get that one needs to have a non-zero mask tokens to attend to it somehow and change their values, but is the same still true/necessary for CNNs?
Did you by any chance ablate the benefits of the mask token (+projection) against just having the zeros post-masking that get passed into the decoder?
Hello,
I am currently re-implementing SparK and stumbled over the masking token that gets re-introduced during
densifying
. https://github.com/keyu-tian/SparK/blob/a63e386f8e5186bc07ad7fce86e06b08f48a61ea/pretrain/spark.py#L99C1-L110C9 I was wondering if you tested how important the Mask Tokens are or if you have any intuition for what the utility of them are. For Transformers I get that one needs to have a non-zero mask tokens to attend to it somehow and change their values, but is the same still true/necessary for CNNs? Did you by any chance ablate the benefits of the mask token (+projection) against just having the zeros post-masking that get passed into the decoder?Thanks for the great work. Cheers, Tassilo