huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

[Feature] LAMB optimizer #85

Open xrsrke opened 7 months ago

xrsrke commented 7 months ago

Implement LAMB optimizer

LAMB optimizer. Efficient training at a large scale is often hindered by batch size constraints. Particularly, increasing the batch size may adversely affect model convergence. The LAMB optimizer [9] has been demonstrated to enable the scaling of BERT's training batch size to $64 \mathrm{~K}$ without compromising accuracy. In the LLM setting, our experiments find that LAMB can scale the batch size to $4 \times$ without accuracy loss. With interleaved pipeline parallelism, the original schedule contains $\frac{4}{v} \frac{p-1}{m}$ pipeline bubbles when training four steps with $1 \times$ batch size [7], while the pipeline bubbles of training one step with $4 \times$ batch size are $\frac{1}{v} \frac{p-1}{4 m}$. Hence, MegaScale reduces $87.5 \%$ of the pipeline bubbles via LAMB optimizer.