microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.37k stars 94 forks source link

How to use 'attn_mult' config #8

Closed JiayiFeng closed 2 years ago

JiayiFeng commented 2 years ago

Hi, thanks for your amazing work!

In the example of using mup in GPT-2: https://github.com/microsoft/mutransformers/tree/main/mutransformers/models/gpt2, I notice that you changed attention scores from kq / sqrt(d) to kq * attn_mult / d, where attn_mult is a new added config (https://github.com/microsoft/mutransformers/blob/main/mutransformers/models/gpt2/modeling_gpt2.py#L205). However, the default value of attn_mult is sqrt(d) (https://github.com/microsoft/mutransformers/blob/main/mutransformers/models/gpt2/configuration_gpt2.py#L199), which makes attention scores back to kq / sqrt(d).

So why do we need this attn_mult? How should I set its value?

Thanks!

thegregyang commented 2 years ago

Yea this is designed so that by default it uses the 1/sqrt(d) scaling so that we are backward compatible, but you can feed in attn_mult=8 for example, or any other concrete number, to get the 1/d scaling. In general, attn_mult should be obtained by tuning, but 8 is a starting good guess.

thegregyang commented 2 years ago

Feel free to open this back up if you have more questions.