Closed JiayiFeng closed 2 years ago
Yea this is designed so that by default it uses the 1/sqrt(d) scaling so that we are backward compatible, but you can feed in attn_mult=8
for example, or any other concrete number, to get the 1/d scaling. In general, attn_mult
should be obtained by tuning, but 8 is a starting good guess.
Feel free to open this back up if you have more questions.
Hi, thanks for your amazing work!
In the example of using mup in GPT-2: https://github.com/microsoft/mutransformers/tree/main/mutransformers/models/gpt2, I notice that you changed attention scores from
kq / sqrt(d)
tokq * attn_mult / d
, whereattn_mult
is a new added config (https://github.com/microsoft/mutransformers/blob/main/mutransformers/models/gpt2/modeling_gpt2.py#L205). However, the default value ofattn_mult
issqrt(d)
(https://github.com/microsoft/mutransformers/blob/main/mutransformers/models/gpt2/configuration_gpt2.py#L199), which makes attention scores back tokq / sqrt(d)
.So why do we need this
attn_mult
? How should I set its value?Thanks!