Scaling Distributed Machine Learning with In-Network Aggregation

pentium3 commented 3 years ago

https://www.usenix.org/conference/nsdi21/presentation/sapio

https://github.com/p4lang/p4app-switchML/tree/main/dev_root

pentium3 commented 3 years ago

In data-parallel distributed ML, communication between works are very intensive. NW becomes the bottleneck to training speed.

idea: use programmable switch to aggregate model updates in-network.

streaming aggregation: use limited registers on switch to aggregate large model(vector). the basic idea is reuse the registers for different bits of the vector in a pipelined manner. (eg: aggregate v[0..4], send result a[0..4] back, then aggregate v[5..9]). "SwitchML instead streams aggregation through the switch: it processes the aggregation function on a limited number of vector elements at once. "

pentium3 commented 3 years ago

switch side(p4):

use this packet format: https://switchml.readthedocs.io/en/latest/readmes/p4.html#switchml-packet-formats

With UDP, SwitchML packets carry a dedicated header between UDP and the payload. A range of UDP ports [0xBEE0, 0xBEEF] are used as destination/source ports in packets going received/sent by the switch. Currently we support a payload that is either 256B or 1024B (using recirculation). This is the overall packet format:

Ethernet	IPv4	UDP	SwitchML	Payload	Ethernet FCS

Switch Controller:

program the switch at runtime. config speed/mac addr/etc. of each port

SwitchML Client Library:

intergrate SwitchML into DNN software stack. eg: for pytorch) , take advantage of PyTorch’s gloo backend and customize it so that it uses switchml instead of Gloo for operations and data types that switchml supports. The patch did these modifications on top of pytorch:

include switchml header files in pytorch. https://github.com/pentium3/p4app-switchML/blob/mod/dev_root/frameworks_integration/pytorch_patch/switchml_pytorch.patch#L83
initiate switchml https://github.com/pentium3/p4app-switchML/blob/mod/dev_root/frameworks_integration/pytorch_patch/switchml_pytorch.patch#L95
replace the allreduce() primitive in pytorch to the switchml one.

ch5: use RDMA to break large messages into individual packets. use small, multi-packet messages(generally 16 packets per message). "Necessary metadata for the SwitchML protocol is encoded in fields of the RDMA header; the RDMA RKey and Address fields are used to encode the destination slot and the address to write the response to."

B Implementation details -- RDMA implementation details: use RDMA to implement packetization/flow control/congestion control/etc. So we can transfer data between GPU and NIC directly.

pentium3 commented 3 years ago

paper with my notes:

nsdi21-sapio.pdf

pentium3 commented 3 years ago

aggregation logic on switch(ch4.3 and Algorithm1):

In the packet sent to switch, "A packet p carries a pool index, identifying the particular aggregator to be used, and contains a vector of k integers to be aggregated." the switch will do aggregation on packets from diff machines, and "the switch outputs the result – by rewriting the packet’s vector with the aggregated value from that particular slot, and sending a copy of the packet to each worker."

also see https://github.com/pentium3/p4app-switchML/blob/mod/dev_root/p4/headers.p4#L160 integers to be aggregated are structured in the header so they could be parsed by p4

pentium3 commented 2 years ago

MTU is dynamic. typically MTU==1500 the packet might be splitted by TCP layer when the packet length>MTU

pentium3 / sys_reading

Scaling Distributed Machine Learning with In-Network Aggregation #131