huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

Added 1-sqrt function for cooldown phase #185

Closed eliebak closed 4 months ago

eliebak commented 4 months ago

Added a 1-sqrt function for the cooldown phase. This function can outperform the classical linear decay method. From this paper https://huggingface.co/papers/2405.18392.

SCR-20240527-lmpm