microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

Batch size, Seq len, Step Transfering #24

Closed timothyxp closed 1 year ago

timothyxp commented 1 year ago

Hi! I didn't fully understand how the transfer of parameters such as batch_size/seq_len/steps should work (Figure 17, 19 in the article). Also I didn't find any mention of this either in the article or in the library code It would seem that according to the idea of mup, we shouldn't do any scales for these parameters, but then it is unclear how it works with batch size. Should I forget about all lr/batch_size dependency rules? what will happen to the convergence rate in this case ?

edwardjhu commented 1 year ago

Thanks for the question, Timofey.

We note in the second paragraph of the intro that

In addition to width, we empirically verify that, with a few caveats, HPs can also be transferred across depth (in Section 6.1) as well as batch size, language model sequence length, and training time (in Appendix G.2.1). This reduces the tuning problem of an (arbitrarily) large model to that of a (fixed-sized) small model. Our overall procedure, which we call µTransfer, is summarized in Algorithm 1 and Fig. 2, and the HPs we cover are summarized in Tables 1 and 2.

This is also mentioned in the caption of Table 1.

You are right that mup doesn't give us any theoretical guarantees for these dimensions, but we need to consider them to make muTransfer useful in practice, which is why we verified them empirically.

thegregyang commented 1 year ago

Adding on to Edward, the usual lr/batch_size dependency rule is when you fix the number of epochs, whereas here we are fixing the number of steps (because we are shrinking the training problem to tune, this makes more sense).