pjlab-sys4nlp / llama-moe

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)
https://arxiv.org/abs/2406.16554
Apache License 2.0
883 stars 46 forks source link

Why a new trainer instead of the original one? 请教一下为什么要新写一个llama_lr_scheduling_trainer,它的作用是什么,为什么不用原始trainer #47

Closed linyubupa closed 10 months ago

linyubupa commented 10 months ago

如题

Spico197 commented 10 months ago

感谢您对本项目的关注~ 我们在原有trainer的基础上增加了如下功能:

  1. llama lr scheduling:transformers中的学习率衰减不支持设置最小学习率,我们在cosine scheduling的基础上扩展了最小学习率的支持。
  2. 更多的tensorboard可视化项:为了增加训练时对gate的一些监控项(如gate load、importance、balance loss等等),只能修改原有的trainer。

Thanks for your attention~ We have added the following functionalities in the new trainer:

  1. The lr scheduling in transformers does not support minimum learning rate. We extend the original lr scheduler to support min lr settings.
  2. More tensorboard visualization items: to support more monitoring items like gate load, importance, balance loss, etc. We have to modify the original trainer.