Why a new trainer instead of the original one? 请教一下为什么要新写一个llama_lr_scheduling_trainer，它的作用是什么，为什么不用原始trainer - Githubissues

pjlab-sys4nlp / llama-moe

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)

https://arxiv.org/abs/2406.16554

Apache License 2.0

883 stars 46 forks source link

Why a new trainer instead of the original one? 请教一下为什么要新写一个llama_lr_scheduling_trainer，它的作用是什么，为什么不用原始trainer #47

Closed linyubupa closed 10 months ago

linyubupa commented 10 months ago

如题

Spico197 commented 10 months ago

感谢您对本项目的关注~ 我们在原有trainer的基础上增加了如下功能：

llama lr scheduling：transformers中的学习率衰减不支持设置最小学习率，我们在cosine scheduling的基础上扩展了最小学习率的支持。
更多的tensorboard可视化项：为了增加训练时对gate的一些监控项（如gate load、importance、balance loss等等），只能修改原有的trainer。

Thanks for your attention~ We have added the following functionalities in the new trainer:

The lr scheduling in transformers does not support minimum learning rate. We extend the original lr scheduler to support min lr settings.
More tensorboard visualization items: to support more monitoring items like gate load, importance, balance loss, etc. We have to modify the original trainer.