Here are my personal paper reading notes (including cloud computing, resource management, systems, machine learning, deep learning, and other interesting stuffs).
Parameter servers (PS) are widely used in distrubuted DNN training. But their performance will be damanged by stragglers because of some reasons (e.g., imbalanced parameter distribution, bandwidth contention, or computation interference).
Few existing studies have investigated efficient parameter (aka load) distribution among parameter servers (PS).
Solution
Propose a dynamic parameter server load distribution scheme called PSLD.
Mitigate PS straggler issues and accelerate distributed model training.
An exploitation-exploration method is used to 1) scale in and out parameter servers, 2) adjust parameter distribution among PSs.
Implemented on BytePS and vanilla MXNet PS architectures.
Presented in SoCC '20. [ Paper ]
Authors: Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, Chuanxiong Guo The University of Hong Kong, ByteDance Inc.