IPDPS '23 workshop | Optimizing Irregular Dense Operators of Heteregeneous GNN Models on GPU

fuse RGCN多个relation kernel的工作.

Motivation: type-specific kernels are called separately, resulting in many small kernels. The goal is to improve GPU utilization by fusing type-specific kernels in RGCN/HGT models

Two baseline

high mem: duplicate the weight matrix (|E| x D1 x D2) from the original (|R| x D1 x D2), resulting in faster processing but higher memory consumption
low mem: group edges by sorting them and using a for loop to process them by edge type.

The best choice of the two operators varies from datasets. There is no one-size-fits-all operator.

Kernel-level optimizations: shared memory for node embeddings, L2 cache for weight matrix, warp for vector-matrix multiplication, and accumulation in GPU registers

Experiments on small datasets (up to 5M edges) show up to 3x speedup for full-graph training and up to 2x for mini-batch.

jasperzhong / read-papers-and-code

IPDPS '23 workshop | Optimizing Irregular Dense Operators of Heteregeneous GNN Models on GPU #355