PipeDevice: A Hardware-Software Co-Design Approach to Intra-Host Container Communication

houminz commented 1 year ago

paper: https://qiangsu97.github.io/files/conext22-final43.pdf

houminz commented 1 year ago

Background:

云上的服务器越来越大，最近已经有 128 core 的机器了 [^1]，可能会有上百个容器跑在同一个服务器上
cloud operators 和 container orchestrators 经常会将一个租户的容器调度到尽可能少的服务器上来提升资源利用率，这样就造成同一个服务器上的容器会更多，导致了 intra-host 上的容器间的 traffic 会很大
类似于 Spark 这种大数据处理的框架经常因为 data locality 局部性原理将 mapper and reducers in containers 部署 co-locate 在同一个服务器上，这种场景下 shuffle stage 会造成大量的 bulky data transfer
在分布式机器学习场景下， parameter server 和 allreduce 的 model update 阶段也会造成大量的数据通信
现在越来越流行的 service mesh 场景中，一个 Pod 中除了业务 container 还会有一个 sidecar proxy container，这种场景下也会有大量的 intra-host 的 bulky data transfer

为了优化这种 TCP 中 bulky data transfer 的问题，一般分为两种解决方案：

Software Approaches
Hardware Approaches

software 方案一般是通过 shared memory to reduce memory copy overhead in data path [^2] [^3]，但是考虑到在 public cloud 中 memory isolation 的要求，以及多个连接的 memory overhead，在 2 个 container 之间直接 memory sharing 不太可能实现。更好的方法是在 container 和 hypervisor 之间 share memory [^3]

[^1]: AMD Zen 4 Epyc CPU. https://www.techradar.com/news/amd-zen-4-epyc-cpu-could-be-an-epic-128-core-256-thread-monster [^2]: Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. SocksDirect: Data-center sockets can be fast and compatible. In Proc. ACM SIGCOMM, 2020 [^3]: Tianlong Yu, Shadi Abdollahian Noghabi, Shachar Raindel, Hongqiang Liu, Jitu Padhye, and Vyas Sekar. FreeFlow: High performance container networking. In Proc. ACM HotNets, 2016.

houminz commented 1 year ago

解决这种问题的一种硬件方案是 hardware-based RDMA，RDMA 可以实现 high throughput 和 low latency 并且没有 cpu overhead。但是 RDMA has poor scalability [^4] [^5] [^6] [^7] [^8] [^9] [^10] [^11]

导致 RDMA 不能够 Scale 的原因是因为 NIC 上用于管理 connection state 的资源有限

The root cause is the contention of limited on-board resources for connection state management

In fact, we show that the performance of commodity RDMA NICs (RNICs) drops by ∼50% with 4096 connections

这里主要解释下 RDMA scalability 的问题。

现象：

随着 connection 的连接数增加，总的 Throughput 并没有一直增加，反而到了一定的连接数之后总的 Throughput 开始下降
随着每个 connection 中的 operations 越多，总的 Throughput 并没有一直增加，反而到了一定的 operations 之后总的 Throughput 开始下降

Hard to support massive concurrent connections

Hard to support large number of operations

在 Scalable RDMA RPC on reliable connection with efficient resource sharing [^4] 这篇文章中 Section 2 对于 RC 模式下 RDMA scalability 问题有比较详细的解释。

下面这张图展示了两个 server 之间 RDMA write 的 message flow:

初始化阶段，在两个 Server 间创建 QP ，这里包含 send queue 和 receive queue 来存储 posted requests。除此之外，还会通过 ibv_reg_mr 来 allocate 和 register 一个或者多个 memory regions，这样远程的服务器就可以直接访问对应的 memory
step 1 中，为了发起一个 write request，发送端的 CPU 会 initiates a verb request and sends it to the local NIC with MMIO
当 NIC 接收到 verb request 之后，NIC 通过 DMA read 直接从 memory 中读取数据(step 2)
并且将其发送出去 (step 3)
当接收侧的 NIC 接收到 message 之后，NIC 直接通过 DMA write 将数据写到本地的 host memory
CPU 会定期检查 memory region 直到发现了一个新的 message

尽管 RDMA bypass 了 kernel，会有一个 resource contention 的问题导致了 RDMA 的 scalability，这里的 resource contention 指的是 NIC cache 和 CPU cache，具体可以分为两种

Cache Contention in the NIC Cache for Outbound Verbs
Cache Contention in the CPU Cache for Inbound Verbs

In practice, the reliable connection (RC) transport mode is commonly used since it supports the more efficient one-sided operations with reliability. In RC transport mode, commodity RNICs cache most connection states in the on-board SRAM for performance, including memory translation tables (MTTs), memory protection tables (MPTs), working queue elements (WQEs), and queue pair (QP) states. Each connection needs ∼375B for QP states alone, while the expensive on-board cache is only a few megabytes.

文献 [^12] 和 [^6] 更加详细的解释了 RDMA NIC 缓存与 Memory 之间的关系，并且阐释了为什么即使是比较新的 Mellanox 网卡，比如 CX5 (2017) 或者 CX6 (2019) 也存在 Scalability 的问题。

NIC Cache 会缓存 MTTs，MPTs, WQEs 和 QP states，仅仅是 QP states 每个连接都需要大约～375B。尽管这个 states 等这些都会保存在 host memory，但是 NIC 为了性能也会将他们换存在 NIC SRAM 上。当连接数增多时，NIC cache 失效，会反复通过 DMA/PCIe READ 从 host memory 获取这些状态。这也是在上图中看到的当连接数增大之后，PCIe READ 会显著增大，这种缓存竞争导致了连接数增大之后性能急剧下降。

To use RDMA, applications register memory regions with the NIC, making them available for remote access. During registration, the NIC driver pins the memory pages and stores their virtual-to-physical address translations in Memory Translation Tables (MTTs). The NIC driver also records the memory region permissions in Memory Protection Tables (MPTs). When serving remote memory requests, the NIC uses MTTs and MPTs to locate the pages and check permissions. The MTTs and MPTs reside in system memory, but the NIC caches them in SRAM. If the MTTs and MPTs overflow the cache, they are accessed from main memory via DMA/PCIe, which incurs overhead. [^12]

ConnectX-5’s connection scalability is, surprisingly, not substantially better than Connect-IB despite the five-year advancement. A simple calculation shows why this is hard to improve: In Mellanox’s implementation, each connection requires ≈375 B of in-NIC connection state, and the NICs have ≈2 MB of SRAM to store connection state as well as other data structures and buffers. 5000 connections require 1.8 MB, so cache misses are unavoidable. [^6]

[^4]: Youmin Chen, Youyou Lu, and Jiwu Shu. Scalable RDMA RPC on reliable connection with efficient resource sharing. In Proc. ACM EuroSys, 2019. [^5]: Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In Proc. USENIX NSDI, 2014. [^6]: Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter RPCs can be general and fast. In Proc. USENIX NSDI, 2019. [^7]: Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using RDMA efficiently for key-value services. In Proc. ACM SIGCOMM, 2014. [^8]: Anuj Kalia, Michael Kaminsky, and David G. Andersen. Design guidelines for high performance RDMA systems. In Proc. USENIX ATC, 2016. [^9]: Anuj Kalia, Michael Kaminsky, and David G. Andersen. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proc. USENIX OSDI, 2016 [^10]: Shin-Yeh Tsai and Yiying Zhang. LITE Kernel RDMA Support for DatacenterApplications. In Proc. ACM SOSP, 2017 [^11]: Jian Yang, Joseph Izraelevitz, and Steven Swanson. FileMR: Rethinking RDMA networking for scalable persistent memory. In Proc. USENIX NSDI, 2020. [^12]: Storm: a fast transactional dataplane for remote data structures

houminz commented 1 year ago

本文的一个重要特点是面向 Intra-Host Container Communication 场景，论点依据是现在越来越多的 workload 都开始运行在容器中，包括大数据/机器学习等等，服务器的规格也越来越大，包括 Service Mesh 这种 SideCar 模式，导致了 Intra-Host Container Communication 越来越多，尤其是这其中 bulky data transfer 比较多。这种情况下，如果能够实现一个 memory isolation, high throughput, low latency, low cpu overhead 并且可以 connection scale 的通信模式，那么将大大提升整个通信的性能。

本文提出的 PipeDevice 的特点和优势在于：

Hardware-Software co-design，通过 hardware offloading 实现 isolation 和降低 CPU overhead
通过将 connection states 保存在 host DRAM 中，并通过 software 管理连接状态，从而实现 scalability
提供了一套类似于 BSD socket 的 API 原语，从而实现了 common stream-based applications, 而不再是基于 message-based RDMA semantics

Each socket is allocated dedicated memory out of a hugepage region in the hypervisor, and application data in the socket buffer is directly accessed by the hardware’s DMA engine and copied to the destination memory address.

The hypervisor also manages the connection states, so that data copy is streamlined and performed in a stateless manner by hardware. PipeDevice also features a set of BSD socket-like APIs for memory and communication to port applications easily.

houminz commented 1 year ago

Container Communication

PipeDevice 的一个重要 Motivation 来自于当前 Intra-Host Communication 通信，我们知道常见的 Container Overlay 容器网络中，data 从 user space 被拷贝到 kernel space 两次，不论是 sender 还是 receiver [^1]。

[^1]: Slim: OS kernel support for a low-overhead container overlay network

houminz commented 1 year ago

Breakdown of Communication Overhead for Bulky Transfers

houminz commented 1 year ago

PipeDevice system architecture

houminz commented 1 year ago

Communication and Memory API

对比标准的 BSD Socket, PD lib 提供了 zero-copy API，这里比较特别的在于

pd_recv()
pd_recv_done()

pd_buf_refresh()
pd_send()

houminz commented 1 year ago

Memory Management

houminz commented 1 year ago

Data Transmission

server C1 runs pd_epoll_wait() on socket S1, and PD Lib creates an epoll request to check if there is any ready data to receive (PD_EPOLLIN event). If not, pd_epoll_wait() is suspended. At C0, when the client application invokes a pd_send() call
PD Lib parses pd_send() and generates a send request to PD Driver. PD Driver determines the free slots in S1’s receive buffer according to its RBT, generates a new queue entry E with the destination memory address, and enqueues E to C0’s SQ
PD Stack obtains E via the SQ. Then it copies data from S0's send buffer to S1's receive buffer according to the address carried in E
PD Stack updates E’s req type to PD_GENERAL_COMPLETED and pushes E into C0’s CQ, and raises an interrupt to PD Driver.
In the interrupt context, PD Driver parses E from the CQ, releases memory in S0’s send buffer, and updates S0’s SBT and S1’s RBT. It also generates a PD_EPOLLIN event for S1, wakes up pd_epoll_wait() and copies the event to PD Lib to return to the server application.
Now the data is ready for the server application, which then calls pd_recv(). PD Lib returns the data chunk pointer obtained from S1’s RBT.
After the data is consumed by the server and pd_recv_done() is invoked, PD Driver receives the request from PD Lib, releases the corresponding memory in S1’s receive buffer, and updates S1’s RBT

houminz commented 1 year ago

Hardware Implementation

PD Stack DMAs data in the host hugepages and performs up to 4 KB reads and writes. Thus data is split into 4 KB chunks in PD Driver.

Read SQ

The host CPU writes a command queue entry E to the SQ, then updates the corresponding SQ and CQ states (e.g., sq_ptr) through MMIO;
Ptr Calculator computes the access offset of SQ and passes it to PD Controller;
PD Controller generates a DMA read request for E, then sends it to DMA engine;
DMA Engine reads E from SQ and sends it to PD Controller

Read payload to receive buffer

PD Controller generates a DMA read request for fetching payload specified in E and sends it to DMA engine.
DMA Engine gets the payload, writes it to Data Buffer, and notifies PD Controller with a read completion message.

Write payload to receive buffer

PD Controller generates a DMA write request for payload and delivers it to DMA Engine;
DMA Engine gets the payload from Data Buffer and writes it to the receive buffer, then sends a write completion message to PD Controller. Now the payload is transmitted.

Write CQ

PD Controller updates E’s req type to PD_GENERAL_COMPLETED, gets CQ’s free slots through Ptr Calculator similar to (2), and generates a DMA write request;
DMA Engine writes the updated E to CQ and generates a write completion message to PD Controller

After above steps, PD Controller creates a new request and sends it to DMA Engine so that an MSI interrupt is invoked by DMA Engine to notify the CPU for kernel processing.

houminz / paper-reading