Closed jiangjiadi closed 9 months ago
Can you share the coordinate check plots like these?
https://github.com/microsoft/mup/tree/main/examples/Transformer/coord_checks
They are really helpful for debugging.
The coordinate check plots are given below. It seems the Mup's plot is much more stable than SP's. However, in the case of high learning rate, sometimes there will be some fluctuations in Mup's plot, and I am not sure whether these fluctuations have an impact on Mup.
SP, Adamw, lr=0.001
SP, Adamw, lr=0.01
SP, Adamw, lr=0.1
Mup, Adamw, lr=0.001
Mup, Adamw, lr=0.01
Mup, Adamw, lr=0.1
Can you extend the x-axis to the right as much as you can for mup?
Can you extend the x-axis to the right as much as you can for mup?
Now, I have tested the width from 48 to 3072 with the base shape = 768.
SP, adamw, lr=0.001
Mup, adamw, lr=0.001
I've identified the problem. Thanks~
@jiangjiadi Could you share what the problem was and how you remedied it?
The conclusion of the paper is engaging. But when I tried to implement hyperparameter transfer using Mup on the GPT-2 model, I encountered some issues.