Closed xwjabc closed 2 years ago
Actually, I found in P47 Appendix J.2.1 - Attention and P24 Attention Logit Scaling, the scale should be sqrt(d_head,0) / d_head
(for backward compatiblity). Does it mean that if we fix d_head
and scale n_head
, we can simply use 1 / sqrt(d_head)
? The same rule should also apply to the grouped convolution if we fix dim size per group and scale number of groups. Thanks!
Or we still follow the same way as shown in the provided Transformer example (scale d_head, only change 1/sqrt(d) to 1/d and keep other settings the same).
Yes. If you follow the README here or the Transformer example, then it automatically scales n_head
correctly as well as d_head
.
Does it mean that if we fix d_head and scale n_head, we can simply use 1 / sqrt(d_head)?
Yes.
The same rule should also apply to the grouped convolution if we fix dim size per group and scale number of groups.
We have not thought about grouped convolution before but I think that's the case after looking at it.
Got it. Thank you for your answer!
Hi, in Appendix E.2 - Number of Attention Heads, there is a use case that fixes
d_head
(dimension size per head) and scalesn_head
(number of heads). Do we need to change anything when we use such multi-head attention with scaledn_head
? Or we still follow the same way as shown in the provided Transformer example (scaled_head
, only change1/sqrt(d)
to1/d
and keep other settings the same).Similarly, when applying to the muP to grouped convolution which keeps dim size per group and scales number of groups, is there any special rule we should follow?
Thanks!