jasperzhong / read-papers-and-code

My paper/code reading notes in Chinese
44 stars 3 forks source link

arXiv '20 | Memory-Efficient Pipeline-Parallel DNN Training #206

Closed jasperzhong closed 3 years ago

jasperzhong commented 3 years ago

https://arxiv.org/pdf/2006.09503.pdf

jasperzhong commented 3 years ago

memory-efficient pipedream

基本思路都是gradient accumulation. 每次BW后不急着更新.

image

image

Planner

考虑了四个维度(w, d, b, r). 分别是pipeline的width(replicas数量), depth, batch size, r 是是否做activation checkpoint.

image

由于搜索空间不是很大. 所以他们使用了exhaustive search. 这个说实话很有用.

jasperzhong commented 3 years ago

温故而知新.

Notations:

m是时间维度的长度,d是空间维度的长度.

methods #activations per device batch size #weight version semantics bubble time fraction
GPipe #168 m B/m 1 x_{t+1} = x_t - \eta g_t (d-1)/m
PipeDream #167 1 to d B d x_{t+1} = xt - \eta g{t-d+1} 0 (steady)
PipeDream-2BW 1 to d B/m 2 x_{t+1} = xt - \eta g{t-1} 0 (steady)
PipeDream-Flush 1 to d B/m 1 x_{t+1} = x_t - \eta g_t (d-1)/m
PipeDream-Flush (interleaved) #188 1 to d B/m 1 x_{t+1} = x_t - \eta g_t (d-1)/m/v

v是每个device分配的#stages.

PipeDeam这种1F1B的schedule,不同机器要存的activations还不一样. 越在pipeline前面的机器,要存的activations就越多. 第一台机器是存的最多的,要存d个,而最后一台机器存的最少,只要存1个.

jasperzhong commented 3 years ago

这个Planner也有点意思. 他们考虑了三种情况

他们考虑的模型是具有repetitive structures,相同的block重复若干次,比如BERT, GPT都是如此. 所以他们模型可以用一个二维tuple来表示(d, w),d代表#partition stages, 而w代表#replica.

另外还需要考虑micro batch size (b),以及做不做recomputation (r).

所以总共的decision variables有

他们的metrics是throughput和memory. 目标是在memory下,尽可能增大throughput. 由于搜索空间不大,他们进行了穷举. 算法如下.

image

这是搜索出来的一些结果. image

第二个问题是cost model. 这确实是一个难点. 我觉得做的比较粗糙. 具体参考paper.

ENg-122 commented 10 months ago

请问下 论文提到的algorithm1中的optimal degree of gradient accumulation是指最优的梯度累积步数吗?