Open houminz opened 1 year ago
Background:
为了优化这种 TCP 中 bulky data transfer 的问题,一般分为两种解决方案:
software 方案一般是通过 shared memory to reduce memory copy overhead in data path [^2] [^3],但是考虑到在 public cloud 中 memory isolation 的要求,以及多个连接的 memory overhead,在 2 个 container 之间直接 memory sharing 不太可能实现。更好的方法是在 container 和 hypervisor 之间 share memory [^3]
[^1]: AMD Zen 4 Epyc CPU. https://www.techradar.com/news/amd-zen-4-epyc-cpu-could-be-an-epic-128-core-256-thread-monster [^2]: Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. SocksDirect: Data-center sockets can be fast and compatible. In Proc. ACM SIGCOMM, 2020 [^3]: Tianlong Yu, Shadi Abdollahian Noghabi, Shachar Raindel, Hongqiang Liu, Jitu Padhye, and Vyas Sekar. FreeFlow: High performance container networking. In Proc. ACM HotNets, 2016.
解决这种问题的一种硬件方案是 hardware-based RDMA,RDMA 可以实现 high throughput 和 low latency 并且没有 cpu overhead。但是 RDMA has poor scalability [^4] [^5] [^6] [^7] [^8] [^9] [^10] [^11]
导致 RDMA 不能够 Scale 的原因是因为 NIC 上用于管理 connection state 的资源有限
The root cause is the contention of limited on-board resources for connection state management
In fact, we show that the performance of commodity RDMA NICs (RNICs) drops by ∼50% with 4096 connections
这里主要解释下 RDMA scalability 的问题。
现象:
Hard to support massive concurrent connections
Hard to support large number of operations
在 Scalable RDMA RPC on reliable connection with efficient resource sharing [^4] 这篇文章中 Section 2 对于 RC 模式下 RDMA scalability 问题有比较详细的解释。
下面这张图展示了两个 server 之间 RDMA write 的 message flow:
ibv_reg_mr
来 allocate 和 register 一个或者多个 memory regions,这样远程的服务器就可以直接访问对应的 memory尽管 RDMA bypass 了 kernel,会有一个 resource contention
的问题导致了 RDMA 的 scalability,这里的 resource contention 指的是 NIC cache 和 CPU cache,具体可以分为两种
In practice, the reliable connection (RC) transport mode is commonly used since it supports the more efficient one-sided operations with reliability. In RC transport mode, commodity RNICs cache most connection states in the on-board SRAM for performance, including memory translation tables (MTTs), memory protection tables (MPTs), working queue elements (WQEs), and queue pair (QP) states. Each connection needs ∼375B for QP states alone, while the expensive on-board cache is only a few megabytes.
文献 [^12] 和 [^6] 更加详细的解释了 RDMA NIC 缓存与 Memory 之间的关系,并且阐释了为什么即使是比较新的 Mellanox 网卡,比如 CX5 (2017) 或者 CX6 (2019) 也存在 Scalability 的问题。
NIC Cache 会缓存 MTTs,MPTs, WQEs 和 QP states,仅仅是 QP states 每个连接都需要大约 ~375B。尽管这个 states 等这些都会保存在 host memory,但是 NIC 为了性能也会将他们换存在 NIC SRAM 上。当连接数增多时,NIC cache 失效,会反复通过 DMA/PCIe READ 从 host memory 获取这些状态。这也是在上图中看到的当连接数增大之后,PCIe READ 会显著增大,这种缓存竞争导致了连接数增大之后性能急剧下降。
To use RDMA, applications register memory regions with the NIC, making them available for remote access. During registration, the NIC driver pins the memory pages and stores their virtual-to-physical address translations in Memory Translation Tables (MTTs). The NIC driver also records the memory region permissions in Memory Protection Tables (MPTs). When serving remote memory requests, the NIC uses MTTs and MPTs to locate the pages and check permissions. The MTTs and MPTs reside in system memory, but the NIC caches them in SRAM. If the MTTs and MPTs overflow the cache, they are accessed from main memory via DMA/PCIe, which incurs overhead. [^12]
ConnectX-5’s connection scalability is, surprisingly, not substantially better than Connect-IB despite the five-year advancement. A simple calculation shows why this is hard to improve: In Mellanox’s implementation, each connection requires ≈375 B of in-NIC connection state, and the NICs have ≈2 MB of SRAM to store connection state as well as other data structures and buffers. 5000 connections require 1.8 MB, so cache misses are unavoidable. [^6]
[^4]: Youmin Chen, Youyou Lu, and Jiwu Shu. Scalable RDMA RPC on reliable connection with efficient resource sharing. In Proc. ACM EuroSys, 2019. [^5]: Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In Proc. USENIX NSDI, 2014. [^6]: Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter RPCs can be general and fast. In Proc. USENIX NSDI, 2019. [^7]: Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using RDMA efficiently for key-value services. In Proc. ACM SIGCOMM, 2014. [^8]: Anuj Kalia, Michael Kaminsky, and David G. Andersen. Design guidelines for high performance RDMA systems. In Proc. USENIX ATC, 2016. [^9]: Anuj Kalia, Michael Kaminsky, and David G. Andersen. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proc. USENIX OSDI, 2016 [^10]: Shin-Yeh Tsai and Yiying Zhang. LITE Kernel RDMA Support for DatacenterApplications. In Proc. ACM SOSP, 2017 [^11]: Jian Yang, Joseph Izraelevitz, and Steven Swanson. FileMR: Rethinking RDMA networking for scalable persistent memory. In Proc. USENIX NSDI, 2020. [^12]: Storm: a fast transactional dataplane for remote data structures
本文的一个重要特点是面向 Intra-Host Container Communication 场景,论点依据是现在越来越多的 workload 都开始运行在容器中,包括大数据/机器学习等等,服务器的规格也越来越大,包括 Service Mesh 这种 SideCar 模式,导致了 Intra-Host Container Communication 越来越多,尤其是这其中 bulky data transfer 比较多。这种情况下,如果能够实现一个 memory isolation, high throughput, low latency, low cpu overhead 并且可以 connection scale 的通信模式,那么将大大提升整个通信的性能。
本文提出的 PipeDevice 的特点和优势在于:
Each socket is allocated dedicated memory out of a hugepage region in the hypervisor, and application data in the socket buffer is directly accessed by the hardware’s DMA engine and copied to the destination memory address.
The hypervisor also manages the connection states, so that data copy is streamlined and performed in a stateless manner by hardware. PipeDevice also features a set of BSD socket-like APIs for memory and communication to port applications easily.
Container Communication
PipeDevice 的一个重要 Motivation 来自于当前 Intra-Host Communication 通信,我们知道常见的 Container Overlay 容器网络中,data 从 user space 被拷贝到 kernel space 两次,不论是 sender 还是 receiver [^1]。
[^1]: Slim: OS kernel support for a low-overhead container overlay network
Breakdown of Communication Overhead for Bulky Transfers
PipeDevice system architecture
Communication and Memory API
对比标准的 BSD Socket, PD lib 提供了 zero-copy API,这里比较特别的在于
pd_recv()
pd_recv_done()
pd_buf_refresh()
pd_send()
Memory Management
Data Transmission
pd_epoll_wait()
on socket S1, and PD Lib creates an epoll request to check if there is any ready data to receive (PD_EPOLLIN event). If not, pd_epoll_wait() is suspended. At C0, when the client application invokes a pd_send()
callpd_send()
and generates a send request to PD Driver. PD Driver determines the free slots in S1’s receive buffer according to its RBT, generates a new queue entry E with the destination memory address, and enqueues E to C0’s SQPD_GENERAL_COMPLETED
and pushes E into C0’s CQ, and raises an interrupt to PD Driver. E
from the CQ, releases memory in S0’s send buffer, and updates S0’s SBT and S1’s RBT. It also generates a PD_EPOLLIN
event for S1, wakes up pd_epoll_wait()
and copies the event to PD Lib
to return to the server application. pd_recv()
. PD Lib returns the data chunk pointer obtained from S1’s RBT.pd_recv_done()
is invoked, PD Driver receives the request from PD Lib, releases the corresponding memory in S1’s receive buffer, and updates S1’s RBTHardware Implementation
PD Stack DMAs data in the host hugepages and performs up to 4 KB reads and writes. Thus data is split into 4 KB chunks in PD Driver.
PD_GENERAL_COMPLETED
, gets CQ’s free slots through Ptr Calculator similar to (2), and generates a DMA write request; After above steps, PD Controller creates a new request and sends it to DMA Engine so that an MSI interrupt is invoked by DMA Engine to notify the CPU for kernel processing.
paper: https://qiangsu97.github.io/files/conext22-final43.pdf