Hi, I want to run quiver's p2p_clique_replicate cache policy on a single server with 4 A100 GPUs. The GPU topology are as follows:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X NV12 PXB PXB 0-25,52-77 0
GPU1 NV12 X PXB PXB 0-25,52-77 0
GPU2 PXB PXB X NV12 0-25,52-77 0
GPU3 PXB PXB NV12 X 0-25,52-77 0
There are NVLinks between GPU 0,1 and GPU 2,3.
According to the documentation, there are two cliques(GPU 0,1 and GPU2,3). The cache should be replicate over two cliques. But I found the cache seems to distribute over 4GPUs.
Here is my code(dist_sampling_ogb_reddit_quiver.py, Reddit dataset, feature 500MB):
quiver.init_p2p(device_list=list(range(world_size)))quiver_feature = quiver.Feature(rank=0, device_list=list(range(world_size)), device_cache_size="0.1G", cache_policy="p2p_clique_replicate", csr_topo=csr_topo)
Theses are what I got:
[0, 1, 2, 3]
LOG>>> P2P Access Initilization
Enable P2P Access Between 0 <---> 1
Enable P2P Access Between 0 <---> 2
Enable P2P Access Between 0 <---> 3
Enable P2P Access Between 1 <---> 2
Enable P2P Access Between 1 <---> 3
Enable P2P Access Between 2 <---> 3
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
LOG>>> 76% data cached
LOG>>> GPU [0, 1, 2, 3] belong to the same NUMA Domain
LOG >>> Memory Budge On 0 is 102 MB
LOG >>> Memory Budge On 1 is 102 MB
LOG >>> Memory Budge On 2 is 102 MB
LOG >>> Memory Budge On 3 is 102 MB
Let's use 4 GPUs!
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
Epoch: 019, Epoch Time: 0.5197241902351379
So I wonder if there is a solution to enable p2p_clique_replicate on my 4 GPU server.
Thanks~
Hi, I want to run quiver's p2p_clique_replicate cache policy on a single server with 4 A100 GPUs. The GPU topology are as follows: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU0 X NV12 PXB PXB 0-25,52-77 0 GPU1 NV12 X PXB PXB 0-25,52-77 0 GPU2 PXB PXB X NV12 0-25,52-77 0 GPU3 PXB PXB NV12 X 0-25,52-77 0 There are NVLinks between GPU 0,1 and GPU 2,3.
According to the documentation, there are two cliques(GPU 0,1 and GPU2,3). The cache should be replicate over two cliques. But I found the cache seems to distribute over 4GPUs. Here is my code(dist_sampling_ogb_reddit_quiver.py, Reddit dataset, feature 500MB): quiver.init_p2p(device_list=list(range(world_size))) quiver_feature = quiver.Feature(rank=0, device_list=list(range(world_size)), device_cache_size="0.1G", cache_policy="p2p_clique_replicate", csr_topo=csr_topo) Theses are what I got: [0, 1, 2, 3] LOG>>> P2P Access Initilization Enable P2P Access Between 0 <---> 1 Enable P2P Access Between 0 <---> 2 Enable P2P Access Between 0 <---> 3 Enable P2P Access Between 1 <---> 2 Enable P2P Access Between 1 <---> 3 Enable P2P Access Between 2 <---> 3 WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access LOG>>> 76% data cached LOG>>> GPU [0, 1, 2, 3] belong to the same NUMA Domain LOG >>> Memory Budge On 0 is 102 MB LOG >>> Memory Budge On 1 is 102 MB LOG >>> Memory Budge On 2 is 102 MB LOG >>> Memory Budge On 3 is 102 MB Let's use 4 GPUs! WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access Epoch: 019, Epoch Time: 0.5197241902351379
So I wonder if there is a solution to enable p2p_clique_replicate on my 4 GPU server. Thanks~