In-network aggregation has been proposed as a promising way to accelerate this collective operation, and thus distributed training [2, 27, 31, #74, 57, #77, #76, #78]. In-network aggregation performs the “reduce” (i.e., sum) step of all-reduce in a network switch on the fly. This offers higher throughput and lower latency than a parameter server approach, where both the network link and host-side network stack can become bottlenecks. Compared to ring-based and other distributed all-reduce algorithms, in-network aggregation requires exchanging fewer messages, again reducing latency and network usage.
[27] N. Gebara, P. Costa, and M. Ghobadi. In-network aggregation for shared machine learning clusters. In Proceedings of the 4th MLSys confrence (MLSys’21), Virtual Event, Apr. 2021
[57] B. Klenk, N. Jiang, G. Thorson, and L. Dennison. An in-network architecture for accelerating shared-memory multiprocessor collectives. In Proceedings of the 47th International Symposium on Computer Architecture (ISCA’20), Virtual Event, May 2020.
86
In-network aggregation has been proposed as a promising way to accelerate this collective operation, and thus distributed training [2, 27, 31, #74, 57, #77, #76, #78]. In-network aggregation performs the “reduce” (i.e., sum) step of all-reduce in a network switch on the fly. This offers higher throughput and lower latency than a parameter server approach, where both the network link and host-side network stack can become bottlenecks. Compared to ring-based and other distributed all-reduce algorithms, in-network aggregation requires exchanging fewer messages, again reducing latency and network usage.