Questions for training gpt-2 using mup

jiangjiadi commented 11 months ago

The conclusion of the paper is engaging. But when I tried to implement hyperparameter transfer using Mup on the GPT-2 model, I encountered some issues.

When scaling the width of the GPT-2 model, I found the optimum shifts still existed, even though I used Mup Initialization. Some of the results are as follows. (Our code is modified from karpathy/nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs. (github.com)) I have checked the std for each layer to ensure the implementation of Mup was not problematic. To eliminate the optimum shifts in mup, I also made some attempts, including whether to scale the weight decay, trying different base shape, and whether to using mixed precision. But these attempts still haven't worked. Can you provide me with more implementation details of Mup on the GPT model?

lr	96	192	384	768
2^-8	4.1291	3.7287	3.4713	nan
2^-9	4.1184	3.7086	3.4041
2^-10	4.1627	3.7074	3.3809	3.1226
2^-11	4.2398	3.7363	3.3776	3.1009
2^-12	4.3393	3.8111	3.4135	3.1074

I don't quite understand the attention logit scaling mentioned in the article. If the attention logit is not scaled, would Mup become ineffective?

edwardjhu commented 11 months ago

Can you share the coordinate check plots like these?

https://github.com/microsoft/mup/tree/main/examples/Transformer/coord_checks

They are really helpful for debugging.

jiangjiadi commented 11 months ago

The coordinate check plots are given below. It seems the Mup's plot is much more stable than SP's. However, in the case of high learning rate, sometimes there will be some fluctuations in Mup's plot, and I am not sure whether these fluctuations have an impact on Mup.