IEEE CAL '22 | Characterizing and Understanding HGNNs on GPUs

jasperzhong commented 11 months ago

https://arxiv.org/pdf/2208.04758.pdf

做了一个benchmark & profling，挺有意思的一些发现

jasperzhong commented 11 months ago

HGNN分四个阶段

Subgraph build: split a HG into multiple subgraphs (就是sampling，包括relation那种或者是metapath walk)
Feature projection: 把不同的feature vectors映射到同一个latent vector space
Neighbor Aggregation: 对每个subgraph aggregate neighbors
Semantic Aggregation: aggregate不同relation的node embedding

benchmark了四个数据集: IMDB, ACM, DBLP. 三个模型: R-GCN, HAN, MAGNN. T4上profile的，主要使用openhgnn和MAGNN的code.

这四个shu'ju数据集都不算大，但是都带node feature，而且dimension不一定相同，好多都看上去是one-hot embedding...

因为是CPU sampling，忽略了subgraph build这一步. 第一张图显示主要还是neighbor aggregation这一阶段耗时间，其次是feature projection，最后一个semantic aggregation不怎么耗时（基本就是加起来，或者一个self-attention的事情）.

第二张图进一步profile了每个阶段耗时的CUDA kernels是什么.

DM-Type: dense matrix multiplication (GEMM)
TB-Type: topology based matrix operation kernel (SpMM, SDDMM)
EW-Type: element-wise compute kernel (unrolled_elementwise_kernel, vectorized_elementwise_kernel, reduce_kernel)
DR-Type: data rearrange kernel (concat) 设计很多data movement.

对于FP (feature projection)，当然是GEMM最耗时; 对于neighbor aggregation (NA), TB-Type的很耗时，EW-Type也挺耗时.

所有这些kernel都可以构成一个reduction-tree computation graph.

看上去sgemm是compute bound. SpMM, SDDMM, elementwise操作都是memory bound.

所以结论是

feature project阶段是主要是sgemm，compute bound.
neighbor aggregation阶段主要是SpMM, SDDMM, elementwise操作，主要是memory bound，并且有irregular memory access pattern (Table 3中，SpMMCsr的L2 hit rate很低).
semantic aggregation阶段没啥好说的，本来就不占啥时间.

在neighbor-aggregation，存在inter-subgraph parallelism. 这里面的challenge是时间可能不一样.

另外还发现随着metapath长度增加，得到的subgraph的sparsity也会下降. interesting. 不过metapath的长度一般固定吧.

metapath-based adjacency matrix其实直接相乘邻接矩阵就能得到了. 一般还是考虑RGCN, RGAT, HGT这几个models吧.

现在问题是，对于大图，是不是干脆都把HetG变成HomoG这样最好？虽然有预处理时间，但时间可以分摊. 而且后面做sampling, neighbor aggregation都要更快. 这个问题不是很清晰，是不是最后能有一个guideline.

jasperzhong commented 11 months ago

关键是HGNN和GNN的workload有什么本质的差异？

paper给了比较qualitative的三点

hetegeneous input: 不同node/edge type的feature dim可能不一样，所以需要一个feature projection phase
multiple semantics: 需要多个message passing
two-stage aggregation: 多了一个semantic aggregation

就是多了个relation parallelism. 没有特别的insights.

jasperzhong / read-papers-and-code

IEEE CAL '22 | Characterizing and Understanding HGNNs on GPUs #359