ATC '23 | Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training

jasperzhong commented 11 months ago

https://www.usenix.org/system/files/atc23-sun.pdf

https://www.usenix.org/system/files/atc23_slides_sun.pdf

interesting read!

jasperzhong commented 11 months ago

扫了一眼PPT，基本是把sampling + feature cache对于OOM的大图优化到极致了. interesting!

之前工作都是feature cache. 这个工作第一做了Topology Cache. great idea.

jasperzhong commented 9 months ago

identify之前SOTA的两个问题: 1) poor multi-GPU cache scalability 2) coarse-grained GPU memory management for graph topology.

之前的SOTA主要有三个: 1) GNNLab (EuroSys '22): 2) Quiver (2022) 3) PaGraph (SoCC '20)

关于第一点: cache scalability.

GNNLab和Quiver都是assume training vertices是globally shuffle among all GPUs. 对于GNNLab，它是replicate cach到各个GPU上，不同GPU的cache的内容是完全一样的，所以增加GPU数量，其实对于降低通信(PCIe transaction数量)没有帮助. 而Quiver对此做了一个优化，多GPU系统里面，部分GPU之间互相有NVLink连接，叫做一个NVLink clique，clique里面的GPU互相传输速度非常快. 所以Quiver就replicate cache across NVLink clique，然后在clique内部再partition cache among GPUs，这样就能拓展cache的实际容量到clique内部GPU的数量. 比如如果一个8 GPUs系统，有2个cliques，每个clique有4个GPUs，这样相当于cache扩大了四倍. 但超过clique内GPU的数量. 上面这个图，PCIe transaction基本可以认为和cache大小成反比.

而PaGraph是partition graph among GPUs, 然后会expand every partition with replicated vertices and edges to include all the k-hop neighbors for each train vertex in this partition. Each GPU only trains its own partition and synchronize its local gradient periodically to update the model. PaGraph提出了graph partition算法，设计了一个heurstic来让各个partition的compuatation比较balance. 但是就会导致大量的冗余. 不同GPU的cache虽然不是replicate的，但是cache内容有大量的冗余. 为了缓解这个问题，他们将PaGraph的partition算法换成了降低edge-cut的XtraPulp. 这时候不同partition expand k-hop neighbors后overlap降低了，PCIe transactions随着GPU数量增加可以实现一个线性的下降. 但问题是，不同GPU cache的hit rate方差比较大，这种graph partition算法没有照顾到load balance，导致load imbalance. 如下图所示，PaGraph的graph partition算法照顾到了load balance，所以cache hit rate方差很小，而PaGraph-Plus的cache hit rate差异很大. 为啥会这样呢？因为如果一味降低edge cut，可能导致some hub vertices with many neighbors分到了同一个partitions，partition的大小可能本身不一样，有的特别大，有的可能特别小. 给定统一的cache site，partition小的graph cache更容易hit，反之更难hit.

所以他们的解决思路很简单，就是PaGraph + Quiver. 设计了一个two-level的graph partition: 1) 对于不同的NVLink clique，用minimize edge-cut的graph partition算法，例如METIS，然后还是会expand k-hop neighbor. 2) clique内部，就用hash partition就行了....这也太straightforward了.

jasperzhong commented 9 months ago

关于2): 就是cache topology. 之前的做法要么全部放到GPU上（对于大图不现实），要么全部放到CPU pin memory上，用UVA方法去access(DGL-UVA, Quiver). UVA方法的问题在于，每次访问非常随机而且数据量很小，导致PCIe的thoughput太低了. 因为topology的access pattern基本也是服从long-tail分布，所以cache topology也很自然.

topology cache存的是hot vertices的topology，即所有邻居信息，用的是CSR format. feature cache就是一个普通的matrix, row代表node ID，每一行存一个feature vector.

具体做法和GNNLab的presampling一样，先做presampling: 之前每个NVLink分到了一组training vertices，然后clique内的training vertices suhffle一下，分到clique内的各个GPU上. NVLink clique内每个GPU能得到一个hotness table: topology hotness table H_T和 feature hotness table H_F. topology hotness是只要sample到source node的一个邻居，就+1. feature hotness是sample到这个source node本身就+1. 这两个hotness不一定一样.

但这里NVLink clique内的cache是不应该有冗余的，所以应该决定which vertex cache on which GPU，他们就是选择hotness更大的那个GPU去存.

jasperzhong commented 9 months ago

最后就是如何trade-off topology cache和feature cache. 这是一个优化问题.

parameter alpha, the memory ratio for topology cache.

minimize the total PCIe traffic

s.t. cache size <= given cache size B

total PCIe traffic = topology traffic + feature traffic. 这两部分其实都很好预测. 因为给定了alpha，可以算出topology cache size和feature cache size，给定了cache size，其实可以根据presampling的hotness算出来哪些vectices应该被cache，然后自然可以根据hotness来计算出uncache的部分的traffic.

search的流程就是brute force，改变alpha从0到1，间隔为0.01 ，算出对应的total traffic. 但这个可以并行去做，所以overhead还好.

jasperzhong / read-papers-and-code

ATC '23 | Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training #362