microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.37k stars 94 forks source link

Questions for training gpt-2 using mup #66

Closed jiangjiadi closed 9 months ago

jiangjiadi commented 11 months ago

The conclusion of the paper is engaging. But when I tried to implement hyperparameter transfer using Mup on the GPT-2 model, I encountered some issues.

lr 96 192 384 768
2^-8 4.1291 3.7287 3.4713 nan
2^-9 4.1184 3.7086 3.4041
2^-10 4.1627 3.7074 3.3809 3.1226
2^-11 4.2398 3.7363 3.3776 3.1009
2^-12 4.3393 3.8111 3.4135 3.1074
edwardjhu commented 11 months ago

Can you share the coordinate check plots like these?

https://github.com/microsoft/mup/tree/main/examples/Transformer/coord_checks

They are really helpful for debugging.

jiangjiadi commented 11 months ago

The coordinate check plots are given below. It seems the Mup's plot is much more stable than SP's. However, in the case of high learning rate, sometimes there will be some fluctuations in Mup's plot, and I am not sure whether these fluctuations have an impact on Mup.

edwardjhu commented 11 months ago

Can you extend the x-axis to the right as much as you can for mup?

jiangjiadi commented 11 months ago

Can you extend the x-axis to the right as much as you can for mup?

Now, I have tested the width from 48 to 3072 with the base shape = 768.

jiangjiadi commented 9 months ago

I've identified the problem. Thanks~

Joelx commented 6 months ago

@jiangjiadi Could you share what the problem was and how you remedied it?