Computer Architecture Letters (CAL) '22 | Characterizing and Understanding Distributed GNN Training on GPUs

jasperzhong commented 2 years ago

https://arxiv.org/pdf/2204.08150.pdf

jasperzhong commented 2 years ago

profile了一下GNN training on multi-gpu.

硬件: 16 x CPU, 4 x T4 GPUs. 没有NVLink.
软件: PyG. NCCL. PyTorch 1.10. CUDA 11.3
训练参数:
- mini-batch size = 1024 target nodes.
- 2 layer GNN (GCN, GraphSAGE, GAT) with hidden size = 256
- sampling neighbors [10, 25]
- 每个process的data loader使用4个线程.

注意看上去是strong scaling. paper里面有一句可以佐证:

As the workloads of one epoch, i.e., target nodes, are distributed equally to parallel workers, it’s expected that the execution time of each phase should decline when more GPUs are involved.

数据集. 其中Paper数据集最大.

关于多卡训练，我估计是每个进程都自己保存了一份图的拷贝.

结果

首先总结了三个observation，都是我们非常熟知的:

data loading阶段是最费时的.
all-reduce阶段时间随着GPU数量增多而增多（后面有解释原因）
实际加速比和理想加速比随着GPU数量增加逐渐增大.

这些结论倒没什么.

他们进行了更深入的分析.

首先注意到，computation time是随着GPU数量增加而下降的. 这是符合预期的，因为他们是strong scaling，per-GPU所需要处理的target nodes数量随着GPU数量增加是减少的.

那为啥随着GPU数量增加，sampling时间反而没有减少了呢. 尤其是paper数据集，咋还增加了呢？本来应该是降低，因为每个进程要采样的target nodes数量降低了. 他们解释是因为不同sampling线程之间的cache竞争问题.

这是有道理的. 这其实也说明，就算是单卡训练，如果sampling threads开很多，可能会带来反效果，即采样时间反而会增加.

sampling is memory-intensive.

为啥对于paper数据集，2个GPU训练的data loading时间反而比4个GPU训练data loading时间还长？这个其实很容易解释，他们是2个socket，两个GPU共用一个to-CPU的PCIe. 这说明那个PCIe Link在一个GPU训练的时候就饱和了. 所以可以看到，当使用4个GPU的时候，data loading时间基本砍半了，因为能用到另一个socket的PCIe了.
为啥all-reduce时间变长了？

这个其实很有意思. 这其实并不是因为all-reduce这个操作本身很耗时，而是因为等待时间.

因为他们每个进程是各自做sampling, data loading, forward-backward. 主要是前面两步 sampling和data loading，每个进程花的时间不一样. 下表可以看到，这几个进程在开始GPU计算（即forward-backward阶段前）的起始时间就有很大偏差.

jasperzhong commented 2 years ago

这篇paper给我认知提高挺大的. 但很可惜，他们是做strong scaling. 我不是很确定strong scaling是不是GNN分布式训练的常用策略. 我需要去调查下. 如果真的是这样，那么这篇paper参考价值很大.

如果不是，而一般是使用weak scaling，那么sampling和data loading的开销将更大，文中一些结论可能会有变化.

jasperzhong / read-papers-and-code

Computer Architecture Letters (CAL) '22 | Characterizing and Understanding Distributed GNN Training on GPUs #305