Closed ADAM-CT closed 4 years ago
--network_bandwidth
is actually a list: so you want to pass in both inter- and intra- server bandwidth using it.
Thanks for your reply. I noticed that you defined the elements in this list as levels, level 1 is the bandwidth within the node, level 2 is the bandwidth between the nodes, so what does level 3 or more mean?
You could imagine a topology where the bandwidth between some GPUs in a given server is higher than between others (think of two sets of 4 GPUs on a 8-GPU server).
Two levels are probably sufficient to model most commonly used topologies.
As you said if I have two servers, each of which has 8 gpus, the bandwidth between the first server's 4 gpus is B1, and the other 4 gpus' bandwidth is B2;The bandwidth between the four gpus on the other server is B3, and the other four are B4.The bandwidth between the two servers is B5.- bandwidth is?
In your example, we would assume that B1 = B2 [the bandwidth for the first level]. Then we assume that each group of 4 is connected by some bandwidth [the bandwidth for the second level]. Finally, the bandwidth between servers is B5 in your example [the bandwidth for the third level].
Thank you very much for your reply.I want to make sure that I understand correctly. eg1: We assume that B1 = B2,B3 = B4: --bandwidth B1,B3, B5; eg2: we asssume that B1=B2=B3=B4: --bandwidth B1 B5
If you have a topology where GPUs within a group of 4 are connected with higher bandwidth than between groups of 4, then you want --bandwidth B1 B3 B5
.
If all 8 GPUs within a server are connected with the same bandwidth, you want --bandwidth B1 B5
.
Thank you. My problem has been solved
Great! Going to close this!
optimizer_graph_hierarchical.py The script's parameter (--network_bandwidth) is bandwidth within the machine. What is bandwidth considered between machines?