Snabb Switch: kernel-bypass networking illustrated

lukego commented 8 years ago

Snabb Switch is a networking application that runs on Linux. However, it does not typically use Linux's networking functionality. Instead it negotiates with the kernel to take control of whole PCI network devices and perform I/O directly without using the kernel as a middle-man. This is kernel-bypass networking.

Sounds abstract? Let us illustrate what that really means.

We will use strace to review the system calls that Snabb Switch makes when it runs an application that accesses the PCI network device with address 0000:01:00.0.

Here we go!

pci device access

First we use sysfs to discover what kind of PCI device 0000:01:00.0 is:

open("/sys/bus/pci/devices/0000:01:00.0/vendor", O_RDONLY) = 4
read(4, "0x8086\n", 4096)               = 7
open("/sys/bus/pci/devices/0000:01:00.0/device", O_RDONLY) = 4
read(4, "0x10fb\n", 4096)               = 7

Good: It's an Intel 82599 10G NIC (Vendor = 0x8086 Device = 0x10fb). We happen to have a driver for this device built into Snabb Switch.

We ask the kernel to please unbind this PCI device from its kernel driver so that it will be available to us:

open("/sys/bus/pci/devices/0000:01:00.0/driver/unbind", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
write(5, "0000:01:00.0", 12)            = 12

We ask the kernel to map the device's configuration registers into our process's virtual address space.

open("/sys/bus/pci/devices/0000:01:00.0/resource0", O_RDWR|O_SYNC) = 5
mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_SHARED, 5, 0) = 0x7fcc1f63b000

Now any time we access the 128KB memory area starting at address 0x7fcc1f63b000 the memory access will automatically be implemented as a callback into the NIC. This is memory-mapped I/O ("MMIO"). Each 32-bit value within this memory region maps onto a configuration register in the PCI device. Intel have a big PDF file (82599 data sheet) explaining what registers exist and what their values mean. We wrote our driver by reading that document and poking the right values into the right register addresses.

This MMIO register access is implemented directly by the CPU and is invisible to the kernel. (We won't see any register access here in the strace log because the kernel does not even know it is happening.)

dma memory

Now we want a memory area in our process that the NIC can read and write packets to using Direct Memory Access (DMA). The NIC will directly read and write to the RAM that belongs to our process. This allows us to transfer packets without any involvement from the kernel.

Really we want three memory areas:

Receive Descriptor Ring where we write the addresses of buffers where we want packets to be stored.
Transmit Descriptor Ring where we write the addresses of buffers that we want to be transmitted.
Packet memory that these addresses refer to.

Here is how we set that up.

First we allocate a huge page of memory. This is a block of memory (2MB or 1GB on x86) that is physically contiguous. This is important because the NIC deals in physical addresses and the descriptor rings are too large to fit on an ordinary 4KB page. (Alternatively we could use the CPU IOMMU feature to share our virtual memory map with the PCI device but we don't consider this hardware mature enough to depend on yet.)

There are several ways to obtain a hugetlb page on Linux. We use the System V shared memory API.

shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 7995392
shmat(7995392, 0, 0)                    = 0x7fcc1e200000

Now we have a chunk of memory in our address space. To make this suitable for DMA we need to "lock" this memory to its current physical address and resolve what that physical address is so that we can tell the NIC.

mlock(0x7fcc1e200000, 2097152)          = 0
open("/proc/self/pagemap", O_RDONLY)    = 6
pread(6, "\0r\366\0\0\0\0\206", 8, 274442686464) = 8

Now for a small flourish: we remap the virtual address in our process to be the same as the physical address but with some high tag bits added. This is convenient for two reasons. First, it makes it very simple and efficient to translate virtual addresses into physical addresses: just mask off the tag bits. Second, it means that when multiple Snabb Switch processes map the same DMA memory they will all map it to the same address. This means that pointers into DMA memory are valid in any Snabb Switch process, which is handy when they cooperate to process packets.

shmat(7995392, 0x500f67200000, 0)       = 0x500f67200000
mlock(0x500f67200000, 2097152)          = 0

and..

That is it!

The real action is still to come, of course, but that is a topic for another time. We wanted to illustrate the interactions between Snabb Switch and the kernel and that is complete. The rest of the story does not involve the kernel and can't be seen with strace.

lukego commented 8 years ago

Here are some references into the code for the curious:

memory.c allocates huge pages and resolves their physical addresses. (This should probably be rewritten in Lua with ljsyscall.)
memory.lua is the layer above where Lua code gets memory for DMA.
pci.lua is where we find and take over PCI devices. (There is some helper code in pci.c that should also probably be rewritten in Lua with ljsyscall.)
The production Intel device driver is intel10g.lua which has helper code in register.lua.
Now that we have more experience writing drivers I wonder if we could reduce the code somewhat to look more like the prototype driver in SnabbCo/snabbswitch#561.

This is some of the first code that was written in Snabb Switch and back then we had not realized how wonderful ljsyscall is so we were still writing C code when we needed system call glue. It would be wonderful to replace that C code because, actually, C is a lousy glue language.

cpach commented 8 years ago

Very fascinating method!

RichardBarrell commented 8 years ago

One snag I don't understand: how do you ensure that the address that you're going to map the DMA memory onto isn't already mapped by something else in your process, such as part of the heap or a library's text segment?

lukego commented 8 years ago

@RichardBarrell I believe this part of the address space is not used on Linux/x86-64 but this is a bit unscientific. It would be good to have a strong reference for this.

Good catch :).

RichardBarrell commented 8 years ago

If I'm reading this right, Linux's man page for shmat(2) implies that it'll only override an existing mapping if you pass the SHM_REMAP flag. Since memory.c passes 0 in the shmflg parameter, it seems that this is safe (attempts to map an address that's in-use will fail harmlessly) but might lead to spurious failures (attempts to map an address that's in-use will fail at the shmat() call).

proofit404 commented 8 years ago

@lukego thank you for this post.

I can't understand one thing. If we work directly with some network device (something like network card or modem for mobile networks) we receive all packages sent to it. Looks like we listening all ports (or sockets) on in. What we need to do if we receive a package destine for another application?

Sorry if I asking stupid question.

lukego commented 8 years ago

Good question @proofit404 :)

Our main application area is high-capacity networking applications. These usually do not involve passing packets through the Linux kernel on the host where Snabb Switch is running. Rather we are usually forwarding packets between network interfaces with some transformation/filtering/logging/etc that the user wants to achieve.

Imagine the data center of an Internet Service Provider. Typically you will find a bunch of network equipment from vendors like Ericsson, Cisco, Juniper, Huawei, etc. The equipment will implement applications with names like Traffic Policy Enforcement, Border Gateway, GGSN, Charging Gateway, RNC, Firewall, Carrier Grade NAT, and so on. This equipment is usually based on proprietary traffic processing hardware and software. The Linux kernel is seldom used to process packets in this environment (though it can be for certain applications and particularly in smaller networks).

Our primary mission here is to introduce open source software running on x86 servers to these high-capacity networks. The Linux kernel has failed in this market for a reason - it doesn't have the performance characteristics that these users are looking for - so our typical applications don't interface with it. Instead we interface directly with the physical network.

Having said that: we can always forward packets to the kernel using any I/O interface that it supports e.g. tap device accelerated with /dev/vhost-net. Just have to write apps for them.

proofit404 commented 8 years ago

@lukego Thanks for so verbose answer. I'm really impressed. Looks like rocket science from the point of view of typical web developer :smiley:

pavel-odintsov commented 8 years ago

Fantastic! Luke, if you wrote book about Snabb internals I will be very glad to be a first buyer :)

lukego / blog

Snabb Switch: kernel-bypass networking illustrated #13

pci device access

dma memory