Open yairgott opened 4 days ago
I think zero-copy is pretty promising. As you mentioned zero-copy isn't "free lunch" though, there is some overhead in tracking references, registering for new events in the event loop for the SO_EE_ORIGIN_ZEROCOPY
messages, and pinning pages into the kernel space. The documentation recommends against it for small writes, but then again on a single threaded architecture there may be less contention for the page pinning so it may do better than they expect.
I think it makes sense to proceed with a prototype and get some benchmark information on how this changes PSYNC/replication streaming performance. It could be the case it doesn't improve much, or it could be a big improvement.
I have something in the works - will add a comment when I have some data to share
So I have a prototype where I enable zero copy on outgoing replication links. I can post a draft PR soon.
I did some local testing on my development machine. Test setup is as follows:
valkey-server --port 6379 --save "" --client-output-buffer-limit "replica 0 0 0"
b. Replica: valkey-server --replicaof localhost 6379 --port 6380 --save "" --repl-diskless-load swapdb
c. I use client-output-buffer-limit "replica 0 0 0"
to prevent replica from being disconnected when I send a lot of write trafficmemtier_benchmark --protocol=redis --server=localhost --port=6379 \
--key-maximum=100000 --data-size=409600 --pipeline=64 --randomize \
--clients=6 --threads=6 --requests=allkeys --ratio=1:0 --key-pattern=P:P
info replication
)What I found is the following: | Setting | Time to write keys to primary | Time for replica to catch up | Total Time |
---|---|---|---|---|
Zero Copy Off | 48.0846 sec | 14.1706 sec | 62.2552 sec | |
Zero Copy On | 49.0943 sec | 1.1838 sec | 50.2781 sec | |
Delta | +2.1% | -91.6% | -19.24% |
I want to test this on a network interface that isn't loopback next. I am guessing things may look a bit different if we are actually going over the wire.
Problem Statement
In the current design, the primary maintains a replication buffer to record mutation commands for syncing the replicas. This replication buffer is implemented as a linked list of chunked buffers. The primary periodically transmits these recorded commands to each replica by issuing socket writes on the replica connections, which involve copying data from the user-space buffer to the kernel. The transmission is performed by the writeToReplica function, which uses connWrite to send data over the socket.
This user-space to kernel buffer copy consumes CPU cycles and increases the memory footprint. The overhead becomes more noticeable when a replica lags significantly behind the primary, as pysnc triggers a transmission burst. This burst may temporarily reduce the primary's responsiveness, with excessive copying and potential TCP write buffer exhaustion being major contributing factors.
Proposal
Modern Linux systems support zero-copy transmission, which operates by:
The primary downside of zero-copy is the need for userspace to manage the send buffer. However, this limitation is much less applicable for the psync use case as Valkey already manages the pysnc replication buffers.
It’s important to note that using zero-copy for psync requires careful adjustments of the replica client write buffers management logic. Specifically, the logic to ensure that the total accumulated replication write buffer size, across all the replica connections, is limited to the value of client-output-buffer-limit replica.
Further reading on zero-copy can be found here. Note that this article states that zero-copy is most effective for large payloads, and experimentation is necessary to determine the minimum payload size. For Memorystore vector search cluster communication, enabling zero-copy in gRPC improved QPS by approximately 8.6%.
Zero-Copy Beyond Psync
Zero-copy can also optimize transmission to clients. In the current implementation, dictionary entries are first copied into the client object's write buffer and then copied again during transmission to the client socket, resulting in two memory copies. Using zero-copy eliminates the client socket copy. Similarly to the psync use case, implementing zero-copy for client transmission requires careful adjustments to the client’s write buffer management logic. The following considerations, while not exhaustive, outline key aspects to address: