microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.14k stars 2.44k forks source link

BEiT initalization #949

Closed vateye closed 1 year ago

vateye commented 1 year ago

Hi, I am currently looking for the initialization of BEiT. The default setting for output projection and FFN weights are scaled by 1 / sqrt(2N), where N is the current layer id. When I see the paper of DeepNet. I have noticed that the scaled factor only needs to fit the constraint 2N(v^2 + w^2) = O(1). Can we just simply set v = w = 1 / (2sqrt(N)) to ensure 2N(v^2 + w^2) = 1?

donglixp commented 1 year ago

@vateye

If you would like to initialize ViT (preLN), the code at https://github.com/microsoft/unilm/blob/master/beit/modeling_finetune.py#L289-L317 can be used. The initialization was first proposed by our XLM-E paper https://arxiv.org/pdf/2106.16138.pdf (Sec 3.4).

image

I also recommend the architecture and initialization of the Magneto architecture https://arxiv.org/pdf/2210.06423.pdf . It's more theoretically grounded and provides better performance compared with ViT (preLN). The implementation is open-source at https://github.com/microsoft/torchscale.

vateye commented 1 year ago

Thanks for your quick response. However, is there any theoretical derivation of this factor 1/sqrt{2l}? I am quite confused how it comes.

donglixp commented 1 year ago

The code and pre-trained models of BEiT-3 can be found at aka.ms/beit3.