Closed weichen582 closed 3 years ago
Hi @weichen582 ,
Thanks for your interest in our work! In the v1 version of the preprint (https://arxiv.org/abs/2106.08254v1), we used plain ViT and ImageNet-1k. The current release adds relative position bias and layerscale, and trains the model on ImageNet-22k. Relative position bias and layerscale obtained 0.2 and 0.1 gains for the base-size model on ImageNet-1k, respectively.
However, we didn't observe improvements of relative position bias on ADE20k (segmentation). If the linear interpolation (because of the changed resolution 224->512 on ADE20k) was used for relative position bias, the performance significantly degraded compared with plain ViT. So we used a more sophisticated interpolation algorithm (as shown in https://github.com/microsoft/unilm/blob/master/beit/semantic_segmentation/mmcv_custom/checkpoint.py#L380 ) to avoid degradation of using relative position bias on ADE20k.
We will update the arXiv preprint to include the above details once more ablation results were ready. I would suggest using plain ViT if you would like retrain the checkpoints, which makes the implementation much easier.
Hi @donglixp ,
Thank you and your team for the great work, and releasing your code. I was exploring the BEiT model recently, and ran into the same issue as @weichen582 . Was there another version of the paper published in the meantime, having additional information about the usage of the relative positional embedding?
Thanks for your time
Hello,
I am playing with the models provided by BEiT. Thanks for the great work!
I notice that BEiT models are using relative positional embedding as well as a tunable gamma, which are not mentioned in the paper. I am curious if these tricks of transformers can also provide improvements on BEiT's baseline, i.e., supervised DEiT/ViT models, and if these tricks are especially helpful for BEiT's pre-training? Do you have any ablation on that?
Thanks a lot,