investigate zero-copy socket writes

From a cursory look it appears that we could potentially implement zero-copy on socket writes by eliminating the TCP_WRITE_FLAG_COPY flag on calls to tcp_write when SO_ZEROCOPY / MSG_ZEROCOPY is specified. User pages not under the domain of the pagecache are, in a sense, pinned already, and pages within the pagecache could be pinned by taking an extra refcount on the pagecache_page. Implementation of socket error queues would also be necessary to allow completion notification to the application.

This could potentially yield a significant performance benefit in cases such as large static page loads when the service supports zero copy (which requires that user buffers remain unmodified until after sent TCP data is acknowledged), but some further exploration might be necessary to verify that in fact the zero copy path - from lwIP through our existing PV nic drivers - will work as expected. Furthermore, note that SO_ZEROCOPY is a hint to the kernel to use zero-copy if available - with a guarantee that completion notifications will be returned - and not a guarantee that copying will be avoided (so a non-compliant driver could result in use of TCP_WRITE_FLAG_COPY with completion notifications).

https://www.kernel.org/doc/html/v4.15/networking/msg_zerocopy.html https://blogs.oracle.com/linux/zero-copy-networking-in-uek6

Did some testing without the TCP_WRITE_FLAG_COPY flag on calls to tcp_write. The zero copy path from lwIP through the nic driver does work as expected (at least with the virtio net driver), in the sense that data from the user buffer are correctly sent to the nic, but doesn't result in an overall performance gain; in fact, I could see a slight degradation in performance (in the order of a few percentage points); and that is without the socket error queue messaging, which when added would likely contribute to a decrease in performance. The reason for the zero-copy path not bringing performance benefits is that the savings from avoiding memory copying are outweighed by the overhead associated with handling an additional buffer in each network packet: when data is copied, each network packet can be sent as a single contiguous buffer, but when the data is not copied, the packet headers need to be allocated in a separate buffer and then chained to the user data buffer. At the nic driver level, the physical address needs to be retrieved for each buffer of a network packet, and this is relatively expensive.

Below is an annotated ftrace plot obtained when streaming TCP data from Nanos to the host machine, without zero-copy: The time from runtime_memcpy is non-negligible (24.6k) but overall doesn't contribute much.

Below is the ftrace plot obtained with zero-copy: The time from runtime_memcpy is considerably reduced (9.5k), but other functions involved in translation from virtual to physical address (such as kern_pointer_from_pteaddr, physical_from_virtual_locked and table_find) take more time.

When trying zero-copy transmission on the loopback interface (where no physical addresses are needed), I did see a performance increase (in the order of 10%), but in Linux zero-copy does not apply to the loopback interface.

I did some more testing with current master, and it is still the case that copying user data to kernel buffers when doing a socket send is more efficient (with Nanos running on qemu and sending TCP data to the local host) than the zero-copy approach. Even though retrieving the physical address from a kernel virtual address is now almost free, this doesn't apply to user buffers, so when doing zero-copy for each network packet being sent there is always a physical_from_virtual() call that involves a table lookup. Looking at the ftrace output (after applying #1551):

non-zero-copy: total time: 2M mcache_alloc: 139k mcache_dealloc: 161k heaplock_alloc: 116k heaplock_dealloc: 75k low_level_output: 70k objcache_alloc: 81k objcache_from_object: 50k objcache_dealloc: 43k physical_from_virtual: 8k runtime_memcpy: 33k

zero-copy: total time: 2.1M mcache_alloc: 141k mcache_dealloc: 184k heaplock_alloc: 111k heaplock_dealloc: 75k low_level_output: 84k objcache_alloc: 76k objcache_from_object: 58k objcache_dealloc: 44k physical_from_virtual: 20k runtime_memcpy: 17k

The reduction of runtime_memcpy time is comparable to the increase in physical_from_virtual time, plus there is a significant increase in the time taken by mcache and objcache alloc/dealloc functions, which stems from the fact that when sending socket data with zero-copy we have to call pbuf_alloc() (allocating from lwip_heap) twice as many times as with the conventional approach. But there might be other factors at play which cannot be easily seen with ftrace. With ftrace enabled, even doing zero-copy on the loopback interface is slight slower than non-zero-copy, whereas without ftrace zero-copy on the loopback interface is around 10% faster than non-zero-copy. Anyway, on the virtio-net interface with zero-copy I see a slight performance degradation (something like 3-5%) both with and without ftrace.

nanovms / nanos

investigate zero-copy socket writes #1299