microsoft / torchscale

Foundation Architecture for (M)LLMs
https://aka.ms/GeneralAI
MIT License
3.01k stars 202 forks source link

About the param `scale_base` #35

Closed horizon94 closed 10 months ago

horizon94 commented 1 year ago

Any advice on the value setting of scale_base(code)? e.g. I want to train a gpt model with window size 2048, and expect it can be extended to 4096 or longer. Would the default value(512) bring too strong long-distance penalty?

sunyt32 commented 1 year ago

For window size 2048, the penalty for default value (512) is scale ^ 4. It won't be too strong, where for the minor penalty, (1 - delta) ^ 4 \approx 1 - 4 * delta. If you want to train with a longer window size such as 8k or 16k, maybe it's better to use scale_base=1k or 2k.