microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

interpreting coord checks #42

Closed llucid-97 closed 1 year ago

llucid-97 commented 1 year ago

Hi there, I'm working on a flax port of this and I'm trying to use the coord check scripts on a variant of your MLP example to see if I've done it right. I'm struggling to interpret the results though:

sp_mlp_sgd_coord μp_mlp_sgd_coord

The point I'm confused on is the green line in the muP graph step 1: if I understood your paper correctly, this should be a flat line right? Looking through my code, i can't spot the mistake though, so I must ask, is my assumption about step 1 of the coord check wrong?

edwardjhu commented 1 year ago

Hi! Does the green curve correspond to the last layer? If so, this is expected. It is for a related reason that we recommend initializing the last layer weights to zero.

llucid-97 commented 1 year ago

Aah I see. Yes it is for the last layer. Thanks