MAG240M distributed training - Githubissues

quiver-team / torch-quiver

PyTorch Library for Low-Latency, High-Throughput Graph Learning on GPUs.

https://torch-quiver.readthedocs.io/en/latest/

Apache License 2.0

293 stars 36 forks source link

MAG240M distributed training #90

Closed ZenoTan closed 2 years ago

ZenoTan commented 3 years ago

Description

Run ogbn-mag240m dataset in Quiver.
Optimize the performance with adaptive data allocation.
Enable distributed feature communication.
Prepare the evaluation results for our paper.

Milestones

[x] NCCL primitives with RDMA support
[x] Feature exchange
[x] Quiver's feature allocation algorithm
[x] Quiver's 240M example according to PyG single-device training (DGL omitted because it is similar)
[x] Quiver's 240M multi-GPU example (run on komodo)
[x] Quiver's 240M multi-node example (run on komodos)
[x] DGL and P3 multi-node baselines (simulation, will add full experiments on the cloud)
[x] Documentation for distributed design and APIs

Notes

A simple algorithm to partition the feature data, and we can improve it
To repeatedly run experiments, we store the features to cache on disks, but there is an option to generate online
There could be a way to fully utilise NCCL bidirectional bandwidth, especially when the number of nodes is not the power of 2

luomai commented 2 years ago

Is this PR ready to merge?

ZenoTan commented 2 years ago

Is this PR ready to merge?

Not yet.