microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.2k stars 2.55k forks source link

BEiT's usage of relative positional embedding and tunable gamma. #376

Closed weichen582 closed 3 years ago

weichen582 commented 3 years ago

Hello,

I am playing with the models provided by BEiT. Thanks for the great work!

I notice that BEiT models are using relative positional embedding as well as a tunable gamma, which are not mentioned in the paper. I am curious if these tricks of transformers can also provide improvements on BEiT's baseline, i.e., supervised DEiT/ViT models, and if these tricks are especially helpful for BEiT's pre-training? Do you have any ablation on that?

Thanks a lot,

donglixp commented 3 years ago

Hi @weichen582 ,

Thanks for your interest in our work! In the v1 version of the preprint (https://arxiv.org/abs/2106.08254v1), we used plain ViT and ImageNet-1k. The current release adds relative position bias and layerscale, and trains the model on ImageNet-22k. Relative position bias and layerscale obtained 0.2 and 0.1 gains for the base-size model on ImageNet-1k, respectively.

However, we didn't observe improvements of relative position bias on ADE20k (segmentation). If the linear interpolation (because of the changed resolution 224->512 on ADE20k) was used for relative position bias, the performance significantly degraded compared with plain ViT. So we used a more sophisticated interpolation algorithm (as shown in https://github.com/microsoft/unilm/blob/master/beit/semantic_segmentation/mmcv_custom/checkpoint.py#L380 ) to avoid degradation of using relative position bias on ADE20k.

We will update the arXiv preprint to include the above details once more ablation results were ready. I would suggest using plain ViT if you would like retrain the checkpoints, which makes the implementation much easier.

marc345 commented 2 years ago

Hi @donglixp ,

Thank you and your team for the great work, and releasing your code. I was exploring the BEiT model recently, and ran into the same issue as @weichen582 . Was there another version of the paper published in the meantime, having additional information about the usage of the relative positional embedding?

Thanks for your time