Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

redknightlois commented 1 year ago

Title: Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Abstract: Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT-2 models of sizes ranging from 125M to 770M, Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time. Theoretically, we show that Sophia adapts to the curvature in different components of the parameters, which can be highly heterogeneous for language modeling tasks. Our run-time bound does not depend on the condition number of the loss

Explainer: https://twitter.com/tengyuma/status/1661412995430219786 Paper: https://arxiv.org/pdf/2305.14342.pdf

Gradient preconditioner:

kozistr commented 1 year ago

Thanks! I'll look into it!

kozistr commented 1 year ago

I just deployed v2.10.0 with SophiaH optimizer.

thanks for the suggestion!

kozistr / pytorch_optimizer

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training #173