A Survey of Large Language Models

distributedに関する部分

Training. Due to the huge model size, it is very challenging to successfully train a capable LLM. Distributed training algorithms are needed to learn the network parameters of LLMs, in which various parallel strategies are often jointly utilized. To support distributed training, several optimization frameworks have been released to facilitate the implementation and deployment of parallel algorithms, such as DeepSpeed [65] and Megatron-LM [66–68]. Also, optimization tricks are also important for training stability and model performance, e.g., restart to overcome training loss spike [56] and mixed precision training [69]. More recently, GPT-4 [46] proposes to develop special infrastructure and optimization methods that reliably predict the performance of large models with much smaller models.

トレーニング：巨大なモデルサイズのため、有能なLLM（大規模言語モデル）を成功させることは非常に困難です。LLMのネットワークパラメータを学習するためには、分散トレーニングアルゴリズムが必要で、その中でさまざまな並列戦略がしばしば共同で利用されます。分散トレーニングをサポートするために、DeepSpeed [65]やMegatron-LM [66-68]など、並列アルゴリズムの実装と展開を容易にするいくつかの最適化フレームワークがリリースされています。また、トレーニングの安定性とモデルのパフォーマンスのために、トレーニングロススパイクを克服する再起動[56]や混合精度トレーニング[69]などの最適化のコツも重要です。最近では、GPT-4 [46]は、はるかに小さいモデルで大規模モデルのパフォーマンスを確実に予測する特別なインフラストラクチャと最適化方法の開発を提案しています。

DeepSpeed [65] is a deep learning optimization library(compatible with PyTorch) developed by Microsoft, which has been used to train a number of LLMs, such as MTNLG [97] and BLOOM [69]. It provides the support of various optimization techniques for distributed training, such as memory optimization (ZeRO technique, gradient checkpointing), and pipeline parallelism.
Megatron-LM [66–68] is a deep learning library developed by NVIDIA for training large-scale language models. It also provides rich optimization techniques for distributed training, including model and data parallelism, mixed precision training, and FlashAttention. These optimization techniques can largely improve the training efficiency and speed, enabling efficient distributed training across GPUs.
BMTrain [151] is an efficient library developed by OpenBMB for training models with large-scale parameters in a distributed manner, which emphasizes code simplicity, low resource, and high availability. BMTrain has already incorporated several common LLMs (e.g., Flan-T5 [64] and GLM [83]) into its ModelCenter, where developers can use these models directly.

[65] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” in KDD, 2020, pp. 3505–3506. [66] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” CoRR, vol. abs/1909.08053, 2019. [67] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, “Efficient large-scale language model training on GPU clusters using megatron-lm,” in International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021. ACM, 2021, p. 58. [68] V. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” CoRR, vol. abs/2205.05198, 2022 [151] “Bmtrain: Effient training for big models.” [Online]. Available: https://github.com/OpenBMB/BMTrain

nariaki3551 / library

A Survey of Large Language Models #45

背景

どんなもの?

先行研究と比べてどこがすごい?

技術や手法のキモはどこ?

どうやって有効だと検証した?

議論はある?

次に読むべき論文は?