Handle Package Delivery Related Syscalls with Batch Processing

lemoer commented 4 years ago

(Originally raised in https://github.com/freifunk-gluon/gluon/issues/2019)

Currently fastd needs one syscall to obtain/deliver a every packet from/to the kernel. The idea is to avoid this by obtaining and delivering multiple packages per syscall.

@NeoRaider wrote:

With Kernel 5.4, we have everything we need in io_uring, making any sendmmsg/recvmmsg-based solution inferior (and we don't need additional kernel patches). So if we make the effort to rework the way fastd handles packets, it should be based on io_uring.

A preliminary test using recvmmsg/sendmmsg showed a performance gain of approximately 30% on a small mips based router with a batch size of 64 (see original thread for details).

CodeFetch commented 4 years ago

I remember having looked at this with lemoer. There was still a performance disadvantage compared to a in-kernel tunnel like WireGuard because of the copying process of the packets between user- and kernelland.

Is there a possibility to introduce something comparable to MSG_ZEROCOPY with io_uring? I've always wondered why Jason Donenfeld, the OpenVPN or tinc team didn't work on exposing the virtualization TAP sockets...

CodeFetch commented 4 years ago

Hm... I just had a look at the current state of the code. SOCK_ZEROCOPY has already been implemented. https://elixir.bootlin.com/linux/v5.8-rc3/source/drivers/net/tap.c#L693 So the only big performance difference between an in-kernel tunnel and an userland one is the additional copying of an skb to an iov. We can't get rid of that one, because skbs can't be forced to be in the user memory region. Actually I expected the impact of the copying to be much lower... By the way @lemoer we've already talked to NeoRaider about io_uring last year on IRC...

CodeFetch commented 4 years ago

Here's a proof-of-concept patched fastd branch which lemoer and I created: https://github.com/CodeFetch/fastd/tree/final

Indeed the syscall overhead can be reduced with io_uring. Unfortunately a kernel version >5.7 is required to allow poll-retry/fastpoll which is crucial for the performance gain.

Furthermore along a number of minor bugs some race conditions seem to occur unless the operations on an individual socket are being hardlinked. This patch works around this issue which introduces a slight performance penalty. It might have been fixed upstream already and needs further testing. I'll open up a pull-request when NeoRaider has reworked the buffer management to reduce the allocation overhead.

CodeFetch commented 3 years ago

@NeoRaider are you done with the buffer pool? I've got a commit somewhere where I started to implement a dynamic buffer pool (which might grow if there is a high demand and shrinks when they are not needed anymore). It looks like your changes are compatible. A dynamic buffer pool is needed for getting a good performance for io_uring while keeping a low memory footprint.

BTW... What about introducing shared memory to implement threading support? Have you given it a thought already? I guess with io_uring the crypto performance will become the bottleneck. Is it possible to do the crypto with packets not "in order"? Otherwise I'd at least hope to make use of more cores on the servers with multiple slave processes.

neocturne commented 3 years ago

The new buffer implementation is finished.

I don't understand the question about shared memory - threads always share their memory? Doing packet processing in threads should be fine on multi-core systems (but it will require some careful locking and/or barriers to ensure that no state is changed when the worker threads do not expect it).

I think packet processing for each peer should be serialized to avoid introducing additional reordering (fastd can handle packets reordered by up to 64 sequence numbers, but the transported network protocols may not), but as multi-core systems usually play a central role in a network and are connected to many peers, this could still provide some speedup.

CodeFetch commented 3 years ago

@NeoRaider Sorry I meant subprocesses not pthreading - Shared memory between processes. Pthreads wouldn't bring a performance increase I guess, would they? Indeed I aim for making use of multiple cores, which isn't possible with pthreads only, is it?

neocturne commented 3 years ago

Using multiple cores is the main use case of threads. In fact, the Linux kernel does not really distinguish between processes and threads - a thread is just a process that shares its PID, memory, file descriptors, and a few other things with its parent.

Using multiple processes as workers only makes sense when you need to isolate them from each other, for example to contain crashes or security issues. For fastd, multithreading is the way to go: It should be easier to implement for our use case and uses fewer resources (as almost all memory is shared).

neocturne / fastd

Handle Package Delivery Related Syscalls with Batch Processing #7