jasperzhong / read-papers-and-code

My paper/code reading notes in Chinese
44 stars 3 forks source link

OSDI '21 | P3: Distributed Deep Graph Learning at Scale #285

Closed jasperzhong closed 2 years ago

jasperzhong commented 2 years ago

https://www.usenix.org/system/files/osdi21-gandhi.pdf

jasperzhong commented 1 year ago

很妙的idea. 之前的工作都是assume data parallelism. P3是引入了model parallelism.

  1. 首先对于graph topology用random hash partition,feature就沿着feature dimension进行partition.
  2. 然后每个partition的training nodes做sampling,这里会有communication,但传输graph topology通信量很低.
  3. 每个partition得到所有training nodes的sampled subgraphs.
  4. 将第一层layer (有最多fan-out的) 用一部分feature dimension算出partial activation. 比如GCN的 Wh = [W1, W2] @ [h1, h2].T = W1 h1 + W2 h2. GAT要麻烦一点,需要算global softmax.
  5. 每个partial activation传输到对应的partition上,然后做aggregate(比如加起来)和layer 1后续的Non-linear部分.
  6. 后续操作和普通的data parallelism一样.

和传统的data parallelism区别在于,第一层layer就不需要fetch features了,只需要传输partial activation,但这个通信量远低于传输features. 原因是

  1. 传输的partial activation是layer 1 target nodes的,而不是其neighbors的,降低十多倍的通信量.
  2. hidden dim维度一般低于features.

所以总体通信量能下降十多倍.

除此之外,P3还引入了pipeline来overlap communication和computation,会引入一定的bounded staleness,最后实验表明不影响精度. 另外还设计了一套API.