google / netstack

IPv4 and IPv6 userland network stack
Apache License 2.0
3.09k stars 279 forks source link

Netstack vs Linux network stack #25

Open Anjali05 opened 5 years ago

Anjali05 commented 5 years ago

Can someone give an overview as to how netstack is different from the linux network stack? I know its a user space application but how does things like scheduling and packet transfer work in netstack and why is the performance affected so much?

hbhasker commented 5 years ago

Netstack is a user-space TCP/IP stack written in Go. When you say scheduling what do you exactly mean. Mostly we don't do any explicit scheduling but rely on the Go runtime, that said we do a few things to reduce scheduler interactions on the packet processing path to reduce latency.

As for performance vs Linux its a complex answer but

a) Linux stack is more mature b) Linux implements lots more TCP optimizations than what exist in Netstack ( that said we are making progress on closing the gap here). c) Netstack runs in user-space and relies on Go runtime which can sometimes add latency/overheads in packet processing ( we are working on improving these)

That said the architecture of Netstack is a bit too big to discuss on a bug. We intend to share more about Netstack in the gVisor community meeting.

As for performance are there any specific use-case you are looking at? If you could share more about how you are comparing Netstack vs Linux then we can see if/what changes can be done to close the gap.

Anjali05 commented 5 years ago

@hbhasker I am trying to understand the netstack architecture to reason about low network performance in gVisor. By scheduling, I mean how are packets scheduled for transmissions and how are packets actually passed to the network interface controller to pass through the physical layer. Also, does the packet at any point pass through the linux stack or is it directly dropped to the physical layer and how?

hbhasker commented 5 years ago

Netstack egresses packets through the link layer endpoints defined in tcpip/link. A typical deployment usually uses a fdbased endpoint in link/fdbased/endpoint.go. Netstack is usually deployed where the fd supplied to the fdbased endpoint is typically some sort of host FD like a unix domain socket or an AF_PACKET or an fd to a tap/tun device.

Today netstack usually makes a host system call to write(writev currently, soon incoming sendmmsg) out packets to the provided FD. That said netstack itself does not care how the packets are egressed you are free to provide an endpoint that bypasses say the linux host stack with something like an FD backed by AF_XDP sockets. We will eventually support it but at the moment it's not something we have had the time to implement.

Similarly for incoming packets the link endpoint runs what we call a dispatch loop which uses host system calls to read incoming packets (readv, recvmmsg) or a ring buffer shared with the host kernel(packet_rx_ring). The latter solution is not recommended yet as there is a kernel bug that can cause the ring buffer to get out of sync between host and Netstack and cause issues when the frame sizes used are large and the # of slots to hold packets is low.

Netstack can totally work with something that drops packets to physical layer and reads from physical layer eg Fuchsia which uses netstack have their own ethernet drivers that deliver packets to Netstack and read outbound packets from Netstack and write them out.

Anjali05 commented 5 years ago

@hbhasker So in case of gVisor, I am assuming the fd is a tap/tun device and the write system call is implemented inside the sentry to minimize interaction with the host? I want to clarify how is netstack used to minimize host interaction in gvisor as from what you mentioned there are a few system calls that need to be made to the host.

hbhasker commented 5 years ago

In case of gVisor the FD is usually an AF_PACKET device when running as a docker runtime. The application's Write() is handled by the epsocket.go in gVisor which in turn calls endpoint.Read() in Netstack which will eventually result in some packets being written out via the link endpoint using writev().

All socket calls made by the application are handled in the gVisor sentry and serviced by Netstack backed sockets unless --network=host is used in which case sockets are backed by host FD's and socket calls are serviced by the host kernel.

Anjali05 commented 5 years ago

@hbhasker It would really be helpful if there is some docs relating to netstack like an architecture overview. Thanks!

ymaxgit commented 4 years ago

userspace tcp stack is a good idea, but underline of netstack ('link' layer) is something basically too heavy as it's based on AF_PACKET/AF_XDP (transaction side) and systemcall/packet_rx_ring (receiving side). Currently there's no benefit in netstack as data is still copied between user space and kerne space, and tcp/ip stack itself in userspace doesn't have more thing than linux kernel tcp/ip stack

jtollet commented 4 years ago

userspace tcp stack is a good idea, but underline of netstack ('link' layer) is something basically too heavy as it's based on AF_PACKET/AF_XDP (transaction side) and systemcall/packet_rx_ring (receiving side). Currently there's no benefit in netstack as data is still copied between user space and kerne space, and tcp/ip stack itself in userspace doesn't have more thing than linux kernel tcp/ip stack

Maybe using VPP as a backend would make sense. It would provide super fast l2/l3 connectivity to netstack.

Bubbachuck commented 4 years ago

userspace tcp stack is a good idea, but underline of netstack ('link' layer) is something basically too heavy as it's based on AF_PACKET/AF_XDP (transaction side) and systemcall/packet_rx_ring (receiving side). Currently there's no benefit in netstack as data is still copied between user space and kerne space, and tcp/ip stack itself in userspace doesn't have more thing than linux kernel tcp/ip stack

Maybe using VPP as a backend would make sense. It would provide super fast l2/l3 connectivity to netstack.

Another option is using NFF-GO as the base framework, which provides "link" layer support.

amscanne commented 4 years ago

Isn't NFF-GO also based on AF_XDP? Containers are almost universally using some software-defined NIC, so things like DPDK don't make much sense except in the general case.