google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.6k stars 1.28k forks source link

runsc: PacketMMAP dispatch mode is buggy. #210

Open hbhasker opened 5 years ago

hbhasker commented 5 years ago

TPACKEtv1/v2 PACKET_RX|TX_RING buffers have a kernel bug where multiple kernel threads can incorrectly assume control of a ring buffer entry resulting in the ring buffer being out of sync with user-space. This seems to manifest itself in our dispatcher in a way where we get stuck in a poll loop where poll returns readable but the frame at our ring offset location does not have the user-bit set.

This completely stalls network. It seems to be easier to trigger when there are lots of connections to an external destination and streaming large amounts of data in.

The bug is described in this thread here

https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html

This is not fixed in the kernel. There are ways to side step it by increasing number of frames which reduces the likelyhood of being hit by this bug. But from my experiments using smaller frames is not worth it as it requires GRO to be disabled but at the same time in a VM GRO/TSO is almost impossible to disable completely. See http://patchwork.ozlabs.org/patch/1016373/

For now we should revert to use RecvMMsg dispatch mode for runsc while change the ring code to use TPACKET_v3 instead of v1.

hbhasker commented 5 years ago

Just an update.

I implemented tpacket_v3 based dispatcher, while that does not suffer from the bug in tpacket v1/v2 and does pretty good throughput on a loaded network, it does suffer from latency issues when the network is not busy.

This is because the way it works is each block is filled 1 at a time by the kernel and only when the block is full or hits the blockTimeout is the skb->data_ready() invoked. The lowest resolution for the blockTimeout is 1ms which is far too high for high speed networking and introduces a large amount of delay.

A simple ping of the host from gVisor can see latencies from 0.4ms -> 7ms. TPacket_V3 is not usable unless a new lower resolution timer is added.

The only path forward for PACKET_RX_RING will be to wait for a fix to the kernel bug.

I believe in the short term we should stick with recvmmsg. I am going to downgrade the priority of this bug as we don't really have a path forward at the moment to use PACKET_RX_RING.

majek commented 1 year ago

@hbhasker I keep coming back to this.

Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which casues the ring to get corrupted by allowing multiple kernel threads to claim ownership of the same ring entry

Do you happen to know when exactly this problem might happen with PACKET_MMAP 1/2? is it on RX or TX?

hbhasker commented 1 year ago

The bug itself is fixed in latest kernels. I never implemented the tx part so this happens on rx. That said @kevinGC has already implemented AF_XDP based rx/tx and I would just use that instead of this. See https://github.com/google/gvisor/blob/master/pkg/tcpip/link/xdp/endpoint.go

majek commented 1 year ago

Interesting. I'm currently investigating vhost-net, and this might be an option as well. It allows for very fast tap device read/write, however, undeniably exposes larger host attack surface.

hbhasker commented 1 year ago

Vhost-net would be an interesting alternative to af_xdp. As af_xdp if I remember correctly can't support frames larger than 4k? So you lose host gro support/gso support. Though @kevinGC has added GRO support to Netstack which probably combined with AF_XDP might solve a large part of that.

Another alternative would be to use the sharedmem link endpoint. It is similar to vhost-net and allows setting up shared memory queues with another process running outside the sandbox. Which can then handle packets and send them to the host or redirect them.

You would need to add support in docker/containers for a cni plugin that creates a nic backed by an fd connected to the external process.

Currently the server endpoint waits on an event fd to be notified when packets are available. But if a single process was to serve many sandboxes then maybe you could just burn a single core to spin across all sandboxes and process packets to reduce delays due to wakeup latencies etc.

kevinGC commented 1 year ago

Yes, AF_XDP is limited to 4K frames. Something to watch out for: supported frame size depends on the driver.

Note that our AF_XDP support is not "battle-hardened" like PACKET_MMAP is. If you do use it we'd definitely appreciate feedback and bug reports. Ideally it becomes default someday.

GRO can be enabled via flag, e.g. --gvisor-gro=200000ns. It too should be on-by-default in time.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 120 days.

github-actions[bot] commented 9 months ago

This issue has been closed due to lack of activity.

manninglucas commented 5 months ago

I tried my own hand at getting more speed with TPACKET_V3 and ran into a similar latency/throughput tradeoff issue. Even with the kernel fix @hbhasker mentioned packet_mmap v1/2 is still slower than rcvmmsg — my testing shows it pulls 1-3 packets before blocking. Rcvmmsg always gets 8.

github-actions[bot] commented 1 month ago

A friendly reminder that this issue had no activity for 120 days.