SoCC '20 | Elastic Parameter Server Load Distribution in Deep Learning Clusters - Githubissues

mental2008 / awesome-papers

Here are my personal paper reading notes (including cloud computing, resource management, systems, machine learning, deep learning, and other interesting stuffs).

https://paper.lingyunyang.com/

MIT License

45 stars 3 forks source link

SoCC '20 | Elastic Parameter Server Load Distribution in Deep Learning Clusters #12

Closed mental2008 closed 3 years ago

mental2008 commented 3 years ago

Presented in SoCC '20. [ Paper ]

Authors: Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, Chuanxiong Guo The University of Hong Kong, ByteDance Inc.

mental2008 commented 3 years ago

A work across HKU and ByteDance.

Motivation

Parameter servers (PS) are widely used in distrubuted DNN training. But their performance will be damanged by stragglers because of some reasons (e.g., imbalanced parameter distribution, bandwidth contention, or computation interference).
Few existing studies have investigated efficient parameter (aka load) distribution among parameter servers (PS).

Solution

Propose a dynamic parameter server load distribution scheme called PSLD.
- Mitigate PS straggler issues and accelerate distributed model training.
- An exploitation-exploration method is used to 1) scale in and out parameter servers, 2) adjust parameter distribution among PSs.
- Implemented on BytePS and vanilla MXNet PS architectures.

The workflow of PSLD is as follows:

Not read the details of the algorithms.