imbue-ai / cluster-health

MIT License
249 stars 32 forks source link

[technical discussion] About 3-layer compute network's congestion problem and nvswitch #1

Closed shh2000 closed 2 months ago

shh2000 commented 2 months ago

Hello,

This is a commendable and practical work, and there are two technical aspects I'd like to discuss with you:

  1. You've implemented a three-tier non-blocking network for the computation(differ from storage network like NFS on a distributed storage system), opting for standard InfiniBand (IB) switches instead of custom RDMA over Converged Ethernet (RoCE) switches or Data Processing Units (DPUs). Have you encountered congestion due to hash collisions? If so, do you have any effective solutions for this issue? In my practice, 3-layer FatTree(core-aggregation-edge) would face severe hash problem compared with 2-layer FatTree(aggregation-edge or so-called "spine-leaf")

  2. Here is a minor query: In your technical blog, you mentioned that PCIe connections were used between GPUs like "Interconnect, like PCIe\n Avoid the term NVSwitch". However, using PCIe alone can significantly reduce the algorithmic MPI bandwidth achievable by NCCL implementations such as ring or tree all-reduce. Additionally, you mentioned performing checks and fixes on the NV Fabric Manager in your diagnostic tools. I'm curious if this implies that the NVSwitch was enabled in your setup.

Looking forward to your insights on these points and looking forward to future technical discussion, thanks!

bawr commented 2 months ago
  1. We haven't encountered this problem even during our all-network IB stress test - not sure if we're comparing apples to apples here, though, since as far as I understand, IB subnet manager with fat tree adaptive routing doesn't cause the switches to use hashes for most of the routing decisions per packet?

  2. NVSwitch was enabled, and indeed was the primary way GPUs talked to one another / their IB NICs - the note comes from the fact that NVIDIA seems to use the term for at least three related things, so for internal discussions we preferred to be more explicit.

bawr commented 2 months ago

The one relevant source for large-scale RoCE deployments I can think of is the MegaScale paper, which might address some of your issues - given their number of GPUs, I think they've also deployed a 3-tier network, but I'm not sure.

shh2000 commented 2 months ago
  1. We haven't encountered this problem even during our all-network IB stress test - not sure if we're comparing apples to apples here, though, since as far as I understand, IB subnet manager with fat tree adaptive routing doesn't cause the switches to use hashes for most of the routing decisions per packet?
  2. NVSwitch was enabled, and indeed was the primary way GPUs talked to one another / their IB NICs - the note comes from the fact that NVIDIA seems to use the term for at least three related things, so for internal discussions we preferred to be more explicit.

aha, I see, perhaps you refers to "nvswitch" and "nvlink network switch" or so-called "nvlink switch"? These concepts indeed tend to be somewhat confusing or easily mixed up.

For question 1, megascale claimed that they use 400*64 switch to build non-converged 12k GPUs cluster. If there're no some "multi-plain" things like alibaba HPN7.0, it should be 3-tier network. Megascale uses DCQCN and such technics to reduce the congestion problem. Perhaps handling congestion issues is an inherent capability of IB(already in IB protocal), while rare RoCE(rare RoCEv2) requires network-side intervention for optimization(Traffic Flow Control or such things).

In fact, I use the RoCE solution when using more than 2k GPUs(64*400 IB can maximun use 2k GPU in a single cluster at 2-tier FatTree network), which means I have no experience on 3-tier IB, just from my hypothesis.

Thanks for you information!