microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.37k stars 94 forks source link

mu parametrization for multi-head attention / grouped convolution #17

Closed xwjabc closed 2 years ago

xwjabc commented 2 years ago

Hi, in Appendix E.2 - Number of Attention Heads, there is a use case that fixes d_head (dimension size per head) and scales n_head (number of heads). Do we need to change anything when we use such multi-head attention with scaled n_head? Or we still follow the same way as shown in the provided Transformer example (scale d_head, only change 1/sqrt(d) to 1/d and keep other settings the same).

Similarly, when applying to the muP to grouped convolution which keeps dim size per group and scales number of groups, is there any special rule we should follow?

Thanks!

xwjabc commented 2 years ago

Actually, I found in P47 Appendix J.2.1 - Attention and P24 Attention Logit Scaling, the scale should be sqrt(d_head,0) / d_head (for backward compatiblity). Does it mean that if we fix d_head and scale n_head, we can simply use 1 / sqrt(d_head)? The same rule should also apply to the grouped convolution if we fix dim size per group and scale number of groups. Thanks!

thegregyang commented 2 years ago

Or we still follow the same way as shown in the provided Transformer example (scale d_head, only change 1/sqrt(d) to 1/d and keep other settings the same).

Yes. If you follow the README here or the Transformer example, then it automatically scales n_head correctly as well as d_head.

Does it mean that if we fix d_head and scale n_head, we can simply use 1 / sqrt(d_head)?

Yes.

The same rule should also apply to the grouped convolution if we fix dim size per group and scale number of groups.

We have not thought about grouped convolution before but I think that's the case after looking at it.

xwjabc commented 2 years ago

Got it. Thank you for your answer!