Closed pentium3 closed 3 years ago
In data-parallel distributed ML, communication between works are very intensive. NW becomes the bottleneck to training speed.
idea: use programmable switch to aggregate model updates in-network.
streaming aggregation: use limited registers on switch to aggregate large model(vector). the basic idea is reuse the registers for different bits of the vector in a pipelined manner. (eg: aggregate v[0..4], send result a[0..4] back, then aggregate v[5..9]). "SwitchML instead streams aggregation through the switch: it processes the aggregation function on a limited number of vector elements at once. "
switch side(p4):
use this packet format: https://switchml.readthedocs.io/en/latest/readmes/p4.html#switchml-packet-formats
With UDP, SwitchML packets carry a dedicated header between UDP and the payload. A range of UDP ports [0xBEE0, 0xBEEF] are used as destination/source ports in packets going received/sent by the switch. Currently we support a payload that is either 256B or 1024B (using recirculation). This is the overall packet format:
Ethernet | IPv4 | UDP | SwitchML | Payload | Ethernet FCS |
---|
Switch Controller:
program the switch at runtime. config speed/mac addr/etc. of each port
intergrate SwitchML into DNN software stack. eg: for pytorch) , take advantage of PyTorch’s gloo backend and customize it so that it uses switchml instead of Gloo for operations and data types that switchml supports. The patch did these modifications on top of pytorch:
ch5: use RDMA to break large messages into individual packets. use small, multi-packet messages(generally 16 packets per message). "Necessary metadata for the SwitchML protocol is encoded in fields of the RDMA header; the RDMA RKey and Address fields are used to encode the destination slot and the address to write the response to."
B Implementation details -- RDMA implementation details: use RDMA to implement packetization/flow control/congestion control/etc. So we can transfer data between GPU and NIC directly.
paper with my notes:
aggregation logic on switch(ch4.3 and Algorithm1):
In the packet sent to switch, "A packet p carries a pool index, identifying the particular aggregator to be used, and contains a vector of k integers to be aggregated." the switch will do aggregation on packets from diff machines, and "the switch outputs the result – by rewriting the packet’s vector with the aggregated value from that particular slot, and sending a copy of the packet to each worker."
also see https://github.com/pentium3/p4app-switchML/blob/mod/dev_root/p4/headers.p4#L160 integers to be aggregated are structured in the header so they could be parsed by p4
MTU is dynamic. typically MTU==1500 the packet might be splitted by TCP layer when the packet length>MTU
https://www.usenix.org/conference/nsdi21/presentation/sapio
https://github.com/p4lang/p4app-switchML/tree/main/dev_root