jasperzhong / read-papers-and-code

My paper/code reading notes in Chinese
44 stars 3 forks source link

ICC workshop '20 | DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs #283

Open jasperzhong opened 2 years ago

jasperzhong commented 2 years ago

https://arxiv.org/pdf/2010.05337

jasperzhong commented 2 years ago

和之前大模型训练很不一样. 分布式GNN训练主要是图太大,vertex直接有非常复杂的依赖关系,而传统的训练sample之间都是互相独立的. 所以这里的挑战是如何.

解决办法看上去不是很复杂. 首先trainer调用RPC让sampler去采样,返回sampled subgraph,然后trainer去存放node features的KV Store去取对应的node features,然后进行data-parallel训练. 如下图所示.

image

graph首先是被paritition成多个subgraph存放在各个机器上,vertex/edge features也随之partition. 每个机器会有一个graph sampler负责其机器上的subgraph的采样.

原来如此,那么看来采样这件事情是基本是locally的.

graph partitioning partition算法的目的是让cross partition的edges数量最少. 这是事先做一次的. 并且会把cross-partition的edge的vertex两边都进行copy. 所以整个系统中,edge只有一份,而vertex可能会重复. 重复的vertex叫做HALO vertices,其他的叫做core vertices. 下面是一个示意图.

image

partition graph一个问题是load balancing. 他们formulate成一个multi-constraint partitioning问题,没形式化.

partition graph后,vertex features和edge features也随之partition. 但是,HALO vertices的features不会duplicated. 这样,所有的vertex features和edge features都不会duplicated.

Distributed KV-Store 内部用shared memory作为IPC. 还是会有跨机通信.

Distributed Sampler trainer用RPC请求sampler. sampler的sampling可以和trainer训练overlap. 秒. 这样就要求RPC是async的.

sampling只应作用于core vertices.

其实有点疑问,cross-partition的graph感觉很难学到啊,因为最多延伸一个节点(HALO vertex)

Mini-batch Trainer

算是明白了. 不过不太懂为啥不事先就balance assign samples to machines呢??

image

jasperzhong commented 2 years ago

image

Linear scalability image

不影响convergence. image

给METIS做了一个ablation study. 看来load balancing很重要. image

yzh119 commented 2 years ago

后来还有一个v2版的:https://arxiv.org/pdf/2112.15345.pdf