Closed jasperzhong closed 2 years ago
很妙的idea. 之前的工作都是assume data parallelism. P3是引入了model parallelism.
和传统的data parallelism区别在于,第一层layer就不需要fetch features了,只需要传输partial activation,但这个通信量远低于传输features. 原因是
所以总体通信量能下降十多倍.
除此之外,P3还引入了pipeline来overlap communication和computation,会引入一定的bounded staleness,最后实验表明不影响精度. 另外还设计了一套API.
https://www.usenix.org/system/files/osdi21-gandhi.pdf