Sonic is a Go library for network and I/O programming that provides developers with a consistent asynchronous model, with a focus on achieving the lowest possible latency and jitter in Go.
MIT License
676
stars
16
forks
source link
Introduce a circular buffer that always returns continuous chunks #105
note: the same syscalls that make the mirrored buffer possible also make aeron possible. There's no other magic involved. Both aeron and sonic strive to use /dev/shm.
note: work in progress. I'm planning to testdrive this on edx in 1 weekish on a machine with plenty of RAM to spare
Why do we need this?
To avoid memory allocations and copies in tcp codecs, such as websocket, http or any other exchange specific protocol. What we mean by tcp codec is best understood through an example.
Say a computer wants to communicate with us reliably. This computer sends us bytes through TCP. Now TCP only deals with reading/writing bytes, so we need to agree on a protocol with the computer to interpret those bytes. Moreover, TCP is a stream transport. A single tcp read might return 1 or 1000 bytes. We don't know ahead of time. This is in contrast with packet based transports such as UDP, where each network read will return us a single packet. A packet is at most 64KB, so we know ahead of time how much memory to allocate to accommodate any packet.
The TCP protocol is simple:
Each message has a variable length. It can be 1 byte or 1GB.
Since the length of each message is variable, we need some information on how big the message is
This information will be encoded in a fixed-sized header of 4 bytes
In short, each message follows 4 bytes that carry its size: |header|variable payload|
Given the above, a single read from the network could give us the following bytes:
|2|00|4|0000|8|00001111|7|00
In the above example we have 3 complete messages of lengths 2 4 and 8 and an incomplete message of length 7 (we only read the first 2 bytes of that message). A further tcp read call will probably read the leftover 5 bytes of the 4th message as well as read some more (possibly incomplete) messages.
Now we look at how to interpret these messages. In the above example, we can:
process the first 3 messages of lengths 2 4 and 8
we can't process the 4th message yet as it is incomplete. We read from the network again.
the read syscall expects a slice of bytes. Say we initially allocated a slice of 16 bytes. Until now we used 2+4+8+2(incomplete message) = 16. That means we don't have any space leftover in the current buffer. We can:
allocate a bigger buffer and only copy the 2 bytes from the 4th message into it
copy the 2 bytes of the 4th message to the beginning of the current buffer, overwriting what's there. This leaves us with 14 bytes to read into.
But we don't want to allocate. That's expensive and unpredictable. We also don't want to copy. That's again expensive, although a bit more predictable. What if, we could use a circular buffer instead?
Now, we can't use a normal circular buffer because each network call expects a contiguous slice of bytes. A circular buffer might wrap, hence returning us two slices to read into, which is incompatible with the read/write syscalls. We also can't use a bip_buffer as TCP is stream, not packet-based.
Given the above, we introduce a mirrored_buffer: a circular buffer that can always return a contiguous slice of bytes. This fully avoids memory allocations and copies for TCP based codecs.
Besides allocating and copying, we can go a 3rd, extremely inefficient and mostly unpredictable route: invoke the read syscall for each header + message. For the above example, this results in 8 syscalls:
for each message, read the 4 bytes, parse it into an integer n and then read the payload of n bytes.
Syscalls are expensive and should be minimized if not totally avoided for latency critical software, such as trading systems. That's why stuff like io_uring or direct-memory-access into network cards exists.
DYI
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/types.h>
const char* name = "/mirrored_buffer_test";
int main() {
int size = sysconf(_SC_PAGE_SIZE);
if (size == -1) {
perror("sysconf");
}
printf("page_size=%d\n", size);
void* base_addr = mmap(NULL, 2 * size, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (base_addr == MAP_FAILED) {
perror("mmap");
}
printf("base_addr=%p\n", base_addr);
int fd = shm_open(name, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
if (fd < 0) {
perror("shm_open");
}
if (shm_unlink(name)) {
perror("shmunlink");
}
if (ftruncate(fd, size)) {
perror("ftruncate");
}
char* first_addr = (char*)base_addr;
char* second_addr = first_addr + size;
void* addr;
printf("first_addr=%p\n", (void*)first_addr);
printf("second_addr=%p\n", (void*)second_addr);
addr = mmap((void*)first_addr, size, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0);
if (addr == MAP_FAILED) {
perror("first mmap");
}
if ((char*)addr != first_addr) {
exit(EXIT_FAILURE);
}
addr = mmap((void*)second_addr, size, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0);
if (addr == MAP_FAILED) {
perror("second mmap");
}
if ((char*)addr != second_addr) {
exit(EXIT_FAILURE);
}
if (close(fd)) {
perror("close");
}
// Write some bytes in the first half of the first mapping.
// All these write will be seen in the second mapping.
char* p;
p = first_addr;
for (size_t i = 0; i < size / 2; i++) {
*p = 1;
p++;
}
p = first_addr;
for (size_t i = 0; i < size; i++) {
printf("%d", *p);
p++;
}
printf("\n\n");
p = second_addr;
for (size_t i = 0; i < size; i++) {
printf("%d", *p);
p++;
}
printf("\n");
for (;;) {
}
}
note: the same syscalls that make the mirrored buffer possible also make aeron possible. There's no other magic involved. Both aeron and sonic strive to use
/dev/shm
.note: work in progress. I'm planning to testdrive this on edx in 1 weekish on a machine with plenty of RAM to spare
Why do we need this?
To avoid memory allocations and copies in tcp codecs, such as websocket, http or any other exchange specific protocol. What we mean by tcp codec is best understood through an example.
Say a computer wants to communicate with us reliably. This computer sends us bytes through TCP. Now TCP only deals with reading/writing bytes, so we need to agree on a protocol with the computer to interpret those bytes. Moreover, TCP is a stream transport. A single tcp
read
might return 1 or 1000 bytes. We don't know ahead of time. This is in contrast with packet based transports such as UDP, where each network read will return us a single packet. A packet is at most 64KB, so we know ahead of time how much memory to allocate to accommodate any packet.The TCP protocol is simple:
|header|variable payload|
Given the above, a singleread
from the network could give us the following bytes:In the above example we have 3 complete messages of lengths 2 4 and 8 and an incomplete message of length 7 (we only read the first 2 bytes of that message). A further tcp
read
call will probably read the leftover 5 bytes of the 4th message as well as read some more (possibly incomplete) messages.Now we look at how to interpret these messages. In the above example, we can:
read
from the network again.read
syscall expects a slice of bytes. Say we initially allocated a slice of 16 bytes. Until now we used 2+4+8+2(incomplete message) = 16. That means we don't have any space leftover in the current buffer. We can:But we don't want to allocate. That's expensive and unpredictable. We also don't want to copy. That's again expensive, although a bit more predictable. What if, we could use a circular buffer instead?
Now, we can't use a normal circular buffer because each network call expects a contiguous slice of bytes. A circular buffer might wrap, hence returning us two slices to read into, which is incompatible with the
read
/write
syscalls. We also can't use a bip_buffer as TCP is stream, not packet-based.Given the above, we introduce a
mirrored_buffer
: a circular buffer that can always return a contiguous slice of bytes. This fully avoids memory allocations and copies for TCP based codecs.Benchmarks
See
BenchmarkMirroredBuffer
.Docs:
Appendix
Besides allocating and copying, we can go a 3rd, extremely inefficient and mostly unpredictable route: invoke the
read
syscall for each header + message. For the above example, this results in 8 syscalls:read
the 4 bytes, parse it into an integern
and thenread
the payload ofn
bytes. Syscalls are expensive and should be minimized if not totally avoided for latency critical software, such as trading systems. That's why stuff like io_uring or direct-memory-access into network cards exists.DYI
Output:
sudo pmap -x
:If we just do the first mapping (
MAP_ANONYMOUS | MAP_PRIVATE
withPROT_NONE
) then:So we can see the that two
fd
mappings of theshm
handle replaces the single anonymous one.