arXiv '20 | Memory-Efficient Pipeline-Parallel DNN Training

jasperzhong / read-papers-and-code

My paper/code reading notes in Chinese

44 stars 3 forks source link

Closed jasperzhong closed 3 years ago

jasperzhong commented 3 years ago

jasperzhong commented 3 years ago

基本思路都是gradient accumulation. 每次BW后不急着更新.

考虑了四个维度(w, d, b, r). 分别是pipeline的width(replicas数量), depth, batch size, r 是是否做activation checkpoint.

由于搜索空间不是很大. 所以他们使用了exhaustive search. 这个说实话很有用.

jasperzhong commented 3 years ago

温故而知新.

Notations:

m是时间维度的长度，d是空间维度的长度.

methods	#activations per device	batch size	#weight version	semantics	bubble time fraction
GPipe #168	m	B/m	1	x_{t+1} = x_t - \eta g_t	(d-1)/m
PipeDream #167	1 to d	B	d	x_{t+1} = xt - \eta g{t-d+1}	0 (steady)
PipeDream-2BW	1 to d	B/m	2	x_{t+1} = xt - \eta g{t-1}	0 (steady)
PipeDream-Flush	1 to d	B/m	1	x_{t+1} = x_t - \eta g_t	(d-1)/m
PipeDream-Flush (interleaved) #188	1 to d	B/m	1	x_{t+1} = x_t - \eta g_t	(d-1)/m/v

v是每个device分配的#stages.

PipeDeam这种1F1B的schedule，不同机器要存的activations还不一样. 越在pipeline前面的机器，要存的activations就越多. 第一台机器是存的最多的，要存d个，而最后一台机器存的最少，只要存1个.

jasperzhong commented 3 years ago

这个Planner也有点意思. 他们考虑了三种情况

他们考虑的模型是具有repetitive structures，相同的block重复若干次，比如BERT, GPT都是如此. 所以他们模型可以用一个二维tuple来表示(d, w)，d代表#partition stages, 而w代表#replica.

另外还需要考虑micro batch size (b)，以及做不做recomputation (r).

所以总共的decision variables有

他们的metrics是throughput和memory. 目标是在memory下，尽可能增大throughput. 由于搜索空间不大，他们进行了穷举. 算法如下.

这是搜索出来的一些结果.

第二个问题是cost model. 这确实是一个难点. 我觉得做的比较粗糙. 具体参考paper.

ENg-122 commented 10 months ago

请问下论文提到的algorithm1中的optimal degree of gradient accumulation是指最优的梯度累积步数吗？