Closed ftgreat closed 10 months ago
Hi there, thanks for pointing out the lr scheduling difference with Megatron. The difference between Megatron's implementation and ours is whether lr should be constant at the last steps.
Our implementation keeps the lr to be constant (min lr) in the last steps. It remains unclear if this would bring further effects. We will keep an eye on this in further experiments. Thanks again for the question.
Thanks for you reply.
https://github.com/pjlab-sys4nlp/llama-moe/blob/main/smoe/trainer/llama_lr_scheduling.py#L125
谢谢您分享的repo。关于lr有个问题,这里支持final_lr_portion,好像和megatron的实现不一样。想请教下是否合理。谢谢 lr下降部分也可能和final_lr_portion有关。