pjlab-sys4nlp / llama-moe

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)
https://arxiv.org/abs/2406.16554
Apache License 2.0
883 stars 46 forks source link

about cosine lr scheduler #53

Closed ftgreat closed 10 months ago

ftgreat commented 10 months ago

https://github.com/pjlab-sys4nlp/llama-moe/blob/main/smoe/trainer/llama_lr_scheduling.py#L125

谢谢您分享的repo。关于lr有个问题,这里支持final_lr_portion,好像和megatron的实现不一样。想请教下是否合理。谢谢 lr下降部分也可能和final_lr_portion有关。

image

Spico197 commented 10 months ago

Hi there, thanks for pointing out the lr scheduling difference with Megatron. The difference between Megatron's implementation and ours is whether lr should be constant at the last steps.

Our implementation keeps the lr to be constant (min lr) in the last steps. It remains unclear if this would bring further effects. We will keep an eye on this in further experiments. Thanks again for the question.

ftgreat commented 10 months ago

Thanks for you reply.