Closed larrylawl closed 6 months ago
The previous implementation only scales the MLP output during init. We decide to follow the init methods of https://github.com/kingoflolz/mesh-transformer-jax to also scale the attention output. (check GPT-NeoX-20B: An Open-Source Autoregressive Language Model Section 2.1.3)
Thanks @jzhang38 for replying!
Can I clarify why is this change made? When I printed the module.weight.size(1)
in the elif isinstance(module, nn.Linear)
block, they vary: they can either be n_embd
or intermediate_size
. Given that the small_init
scheme is a function of the dimension, shouldn't it be module.weight.size(1)
instead of n_embd
?
For the following line, shouldn't the numerator be 2 instead of 1?
Sorry peiyuan, can you point me to the code which implemented this? I can't seem to find it within the codebase.
We decide to follow the init methods of https://github.com/kingoflolz/mesh-transformer-jax to also scale the attention output.
Bump jicymi @jzhang38
Hi Larry sorry for the late reply. I am busy with another project these days. Next time you can ping me on Twitter if I forget to respond.
For the following line, shouldn't the numerator be 2 instead of 1?
GPT-Neo X uses parallel attention and FF layer, and its section 2.1.3 says "with the factor of 2 compensating for the fact that the parallel and feed-forward layers are organized in parallel". Since Llama use sequential attention & FF, the numerator should be 1.
When I printed the module.weight.size(1) in the elif isinstance(module, nn.Linear) block, they vary: they can either be n_embd or intermediate_size. Given that the small_init scheme is a function of the dimension, shouldn't it be module.weight.size(1) instead of n_embd?
This is a really good question. I have the same doubts as you initially. In the end, I decided to follow the implementation of GPT-Neo X, where they use the n_embd value instead of weight.size(1). https://github.com/EleutherAI/gpt-neox/blob/f14782a571b9b4ff52803ce57c2bfc650670c30a/megatron/model/init_functions.py#L204
I think reading section 2.2 of Transformers without Tears: Improving the Normalization of Self-Attention could solve your double. In short, the reason why we use "5d" is because the sum of input and output dim for the FF upsample/downsample layer is 5d. For attention qkv, this sum is "2d". In that paper, the author proposes to initialize attention qkv as 5d as well, which is why the name "small init". Note that d here refers to the transformers's n_embed.
(On the other hand, the hidden dim of swiglu is actually 8/3d instead of 4d. You can see this is another issue.)
No worries at all. Thanks Peiyuan!
Hi, I noticed that the team changed the
_init_weights
function in this commit: https://github.com/jzhang38/TinyLlama/commit/89c75f40ec99cb48ea94c92423c1deae67bc2329Can I check the motivation for doing so? Thanks!