microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

Are parameters with no "infinite" dimensions allowed? #29

Closed callumm-graphcore closed 1 year ago

callumm-graphcore commented 1 year ago

Hi,

Is it valid to have parameters that have no "infinite" dimensions? This line suggests that it is, but I can't find anything in the paper that explains how this case should be dealt with.

With thanks, Callum

edwardjhu commented 1 year ago

Hi Callum,

Yes, it's possible to have parameters with only finite dimensions. For example, given a finite output dimension d_out, the bias vector for the last layer will have dimension 1 x d_out.

callumm-graphcore commented 1 year ago

Thanks Edward! Is there a part of the paper that explains what the correct scaling is in this case? Would this apply even if you had a linear layer where neither the input nor the output dimension was scaled?

edwardjhu commented 1 year ago

The bias example I gave is covered under input weights & biases in Table 3, 8, and 9, and it has a constant init and LR.

Yes, it also applies when you have a linear layer. We might not have talked about it specifically in the paper since it's less common, but you should use a constant init and LR.

callumm-graphcore commented 1 year ago

Ah, OK, I see now. Thank you very much!

thegregyang commented 1 year ago

Yes that is allowed.

On Mon, Nov 28, 2022, 9:26 AM Callum @.***> wrote:

Ah, OK, I see now. Thank you very much!

— Reply to this email directly, view it on GitHub https://github.com/microsoft/mup/issues/29#issuecomment-1329294829, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHM6NGFEYMSIZ5QOAHRDWKTFK5ANCNFSM6AAAAAASNLDV2I . You are receiving this because you are subscribed to this thread.Message ID: @.***>