Closed horizon94 closed 10 months ago
For window size 2048, the penalty for default value (512) is scale ^ 4. It won't be too strong, where for the minor penalty, (1 - delta) ^ 4 \approx 1 - 4 * delta. If you want to train with a longer window size such as 8k or 16k, maybe it's better to use scale_base=1k or 2k.
Any advice on the value setting of
scale_base
(code)? e.g. I want to train a gpt model with window size 2048, and expect it can be extended to 4096 or longer. Would the default value(512) bring too strong long-distance penalty?