nariaki3551 / library

Stock of papers and articles to read
1 stars 0 forks source link

A Survey of Large Language Models #45

Open nariaki3551 opened 1 year ago

nariaki3551 commented 1 year ago

背景

どんなもの?

先行研究と比べてどこがすごい?

技術や手法のキモはどこ?

どうやって有効だと検証した?

議論はある?

次に読むべき論文は?

nariaki3551 commented 11 months ago

distributedに関する部分

Training. Due to the huge model size, it is very challenging to successfully train a capable LLM. Distributed training algorithms are needed to learn the network parameters of LLMs, in which various parallel strategies are often jointly utilized. To support distributed training, several optimization frameworks have been released to facilitate the implementation and deployment of parallel algorithms, such as DeepSpeed [65] and Megatron-LM [66–68]. Also, optimization tricks are also important for training stability and model performance, e.g., restart to overcome training loss spike [56] and mixed precision training [69]. More recently, GPT-4 [46] proposes to develop special infrastructure and optimization methods that reliably predict the performance of large models with much smaller models.

トレーニング:巨大なモデルサイズのため、有能なLLM(大規模言語モデル)を成功させることは非常に困難です。LLMのネットワークパラメータを学習するためには、分散トレーニングアルゴリズムが必要で、その中でさまざまな並列戦略がしばしば共同で利用されます。分散トレーニングをサポートするために、DeepSpeed [65]やMegatron-LM [66-68]など、並列アルゴリズムの実装と展開を容易にするいくつかの最適化フレームワークがリリースされています。また、トレーニングの安定性とモデルのパフォーマンスのために、トレーニングロススパイクを克服する再起動[56]や混合精度トレーニング[69]などの最適化のコツも重要です。最近では、GPT-4 [46]は、はるかに小さいモデルで大規模モデルのパフォーマンスを確実に予測する特別なインフラストラクチャと最適化方法の開発を提案しています。

[65] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” in KDD, 2020, pp. 3505–3506. [66] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” CoRR, vol. abs/1909.08053, 2019. [67] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, “Efficient large-scale language model training on GPU clusters using megatron-lm,” in International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021. ACM, 2021, p. 58. [68] V. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” CoRR, vol. abs/2205.05198, 2022 [151] “Bmtrain: Effient training for big models.” [Online]. Available: https://github.com/OpenBMB/BMTrain