gsliepen / tinc

a VPN daemon
http://tinc-vpn.org/
Other
1.87k stars 280 forks source link

Performance improvements via TSO/GRO and UDP_SEGMENT #439

Open byo-books opened 11 months ago

byo-books commented 11 months ago

This blog post by tailscale sounds promising. It points out that the Linux Tun device supports TSO/GRO offloading.

Also, there is another post for using GSO (Generic Segmentation Offload) to send multiple UDP packets from a single large buffer.

Both techniques reduce network stack traversals. Unfortunatedly these features do not seem to be well documented.

splitice commented 3 months ago

If you look at benchmarks of tinc you will quickly find that for many real world workloads the largest user of CPU time is TUN/TAP.

I did some work on sendmmsg in the past but rand into architectural issues primarily. Tinc was never built to handle a queue of packets (but this can change!)

If you really want performance for tinc, build ktincd (linux kernel tinc). I've debated it numerous times.

It was originally going to be one of my next experiments after the AES protocol changes merged (but they never did)

The networking side of it wouldnt be too hard. Tinc is structured well enough that adapting to a linux netdev would not be too difficult. Configuration though is potentially a real nightmare.

gsliepen commented 3 months ago

Another option might be investigating if io_uring can be used, and what performance improvements that can give.

splitice commented 3 months ago

I don't know if io_uring is really worth the effort tbh. I don't have strong data to back this up however.

Packet mmap appears to be the fastest way to read / send from tap.

See for example https://github.com/google/gvisor/blob/master/pkg/tcpip/link/fdbased/mmap_unsafe.go#L50

However tinc doesn't have the architecture in place for batching on the tap side. And that's what holds me back. I'm not certain I want to do that level of change without guidance.

gsliepen commented 3 months ago

The advantage of io_uring is that you don't have to batch things at all in the application. You can still do single packet read()/write()/send()/recv() calls, but instead of them being system calls, you enqueue them on the io_uring. You can also have buffers shared between userspace and kernelspace, so you can theoretically avoid copies being made. However, I don't know how well that works compared to packet mmap.