Closed vateye closed 1 year ago
@vateye
If you would like to initialize ViT (preLN), the code at https://github.com/microsoft/unilm/blob/master/beit/modeling_finetune.py#L289-L317 can be used. The initialization was first proposed by our XLM-E paper https://arxiv.org/pdf/2106.16138.pdf (Sec 3.4).
I also recommend the architecture and initialization of the Magneto architecture https://arxiv.org/pdf/2210.06423.pdf . It's more theoretically grounded and provides better performance compared with ViT (preLN). The implementation is open-source at https://github.com/microsoft/torchscale.
Thanks for your quick response. However, is there any theoretical derivation of this factor 1/sqrt{2l}? I am quite confused how it comes.
The code and pre-trained models of BEiT-3 can be found at aka.ms/beit3.
Hi, I am currently looking for the initialization of BEiT. The default setting for output projection and FFN weights are scaled by 1 / sqrt(2N), where N is the current layer id. When I see the paper of DeepNet. I have noticed that the scaled factor only needs to fit the constraint 2N(v^2 + w^2) = O(1). Can we just simply set v = w = 1 / (2sqrt(N)) to ensure 2N(v^2 + w^2) = 1?