issues
search
quiver-team
/
torch-quiver
PyTorch Library for Low-Latency, High-Throughput Graph Learning on GPUs.
https://torch-quiver.readthedocs.io/en/latest/
Apache License 2.0
293
stars
36
forks
source link
MAG240M distributed training
#90
Closed
ZenoTan
closed
2 years ago
ZenoTan
commented
3 years ago
Description
Run
ogbn-mag240m
dataset in Quiver.
Optimize the performance with adaptive data allocation.
Enable distributed feature communication.
Prepare the evaluation results for our paper.
Milestones
[x] NCCL primitives with RDMA support
[x] Feature exchange
[x] Quiver's feature allocation algorithm
[x] Quiver's 240M example according to PyG single-device training (DGL omitted because it is similar)
[x] Quiver's 240M multi-GPU example (run on komodo)
[x] Quiver's 240M multi-node example (run on komodos)
[x] DGL and P3 multi-node baselines (simulation, will add full experiments on the cloud)
[x] Documentation for distributed design and APIs
Notes
A simple algorithm to partition the feature data, and we can improve it
To repeatedly run experiments, we store the features to cache on disks, but there is an option to generate online
There could be a way to fully utilise NCCL bidirectional bandwidth, especially when the number of nodes is not the power of 2
luomai
commented
2 years ago
Is this PR ready to merge?
ZenoTan
commented
2 years ago
Is this PR ready to merge?
Not yet.
Description
Milestones
Notes