quiver-team / torch-quiver

PyTorch Library for Low-Latency, High-Throughput Graph Learning on GPUs.
https://torch-quiver.readthedocs.io/en/latest/
Apache License 2.0
293 stars 36 forks source link

How to run quiver on server with complex GPU topology? #135

Open JIESUN233 opened 2 years ago

JIESUN233 commented 2 years ago

Hi, I want to run quiver's p2p_clique_replicate cache policy on a single server with 4 A100 GPUs. The GPU topology are as follows: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU0 X NV12 PXB PXB 0-25,52-77 0 GPU1 NV12 X PXB PXB 0-25,52-77 0 GPU2 PXB PXB X NV12 0-25,52-77 0 GPU3 PXB PXB NV12 X 0-25,52-77 0 There are NVLinks between GPU 0,1 and GPU 2,3.

According to the documentation, there are two cliques(GPU 0,1 and GPU2,3). The cache should be replicate over two cliques. But I found the cache seems to distribute over 4GPUs. Here is my code(dist_sampling_ogb_reddit_quiver.py, Reddit dataset, feature 500MB): quiver.init_p2p(device_list=list(range(world_size))) quiver_feature = quiver.Feature(rank=0, device_list=list(range(world_size)), device_cache_size="0.1G", cache_policy="p2p_clique_replicate", csr_topo=csr_topo) Theses are what I got: [0, 1, 2, 3] LOG>>> P2P Access Initilization Enable P2P Access Between 0 <---> 1 Enable P2P Access Between 0 <---> 2 Enable P2P Access Between 0 <---> 3 Enable P2P Access Between 1 <---> 2 Enable P2P Access Between 1 <---> 3 Enable P2P Access Between 2 <---> 3 WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access LOG>>> 76% data cached LOG>>> GPU [0, 1, 2, 3] belong to the same NUMA Domain LOG >>> Memory Budge On 0 is 102 MB LOG >>> Memory Budge On 1 is 102 MB LOG >>> Memory Budge On 2 is 102 MB LOG >>> Memory Budge On 3 is 102 MB Let's use 4 GPUs! WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access Epoch: 019, Epoch Time: 0.5197241902351379

So I wonder if there is a solution to enable p2p_clique_replicate on my 4 GPU server. Thanks~