google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.85k stars 1.3k forks source link

Netstack silly-window-syndrome mitigation is too restrictive #1398

Closed majek closed 4 years ago

majek commented 4 years ago

Good morning!

I'm running tuntap interface inside runsc-sandbox namespace, and slirp4netns (a userspace network stack) on the host side. Therefore all networking goes:

Setting this up is not that hard, take a look at the prepared OCI-gvisor reproducer script: https://gist.github.com/majek/bede99c29bd49060a20128ccf720af21

This setup doesn't work well in --network=sandbox mode. In --network=host mode, though, it's totally fine. For testing I'm running a simple http server on the host side, on host loopback.

In --network=host mode downloading large asset works fine:

[*] running curl
Serving HTTP on 0.0.0.0 port 8080 ...
127.0.0.1 - - [16/Dec/2019 09:26:24] "GET /file.bin HTTP/1.1" 200 -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1024M  100 1024M    0     0  84.3M      0  0:00:12  0:00:12 --:--:-- 98.3M

Notice curl on --network=host, going over slirp4netns reports ~100MiB/s. Good.

Now, when we switch to netstack --network=sandbox the performance drops off the cliff:

[*] running curl
127.0.0.1 - - [16/Dec/2019 09:26:57] "GET /file.bin HTTP/1.1" 200 -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16.0M  100 16.0M    0     0  1534k      0  0:00:10  0:00:10 --:--:--  334k

With speed reaching up to meager 330KiB/s. The framework is consuming no cpu. I suspect some congestion control issues or maybe just a transmitssion timeout bug in netstack.

When debugging this, I tried:

Passing "--keepalive-time 1" to curl does somehow alleviate the problem - I get ~5MiB/s rate.

I suspect the problem is that for some reason netstack gets stuck when talking to slirp implementation. As far as I can tell, slirp is behaving well, politely waiting for new packets with epoll.

The netstack though, I think gets stuck in the particular case of fast-speed connectivity with slirp. This seem to be confirmed by the fact that slowing down slirp (by stracing it) makes the problem disappear. Cheers to the heisenbugs.

Please let me know how to debug this issue, and how to handle netstack performance / transmission problems.

hbhasker commented 4 years ago

Thanks for the report. I will take a look today. Btw PacketMMap mode is buggy and should not be used. Do you have clear set of steps for repro.

hbhasker commented 4 years ago

Ah nm I see the script. Willl try it out.

majek commented 4 years ago

Ok, I have some progress. The fingers now are pointed at slirp. This is tcpdump (real one) on tuntap in the gvisor-sandbox namespace:

$ nsenter -n -t `pidof runsc-sandbox` tcpdump -ni eth0 -B 4096 -w x.pcap
$ tcpdump -n -r x.pcap  -ttt
...
 00:00:00.000047 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [P.], seq 15057103:15057106, ack 86, win 65535, length 3: HTTP
 00:00:00.000144 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15057106, win 65535, length 0
 00:00:00.000288 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15057106:15089874, ack 86, win 65535, length 32768: HTTP
 00:00:00.000052 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15089874:15122641, ack 86, win 65535, length 32767: HTTP
 00:00:00.000136 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15122641, win 65535, length 0
 00:00:00.000054 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15122641:15155409, ack 86, win 65535, length 32768: HTTP
 00:00:00.000051 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15155409:15188176, ack 86, win 65535, length 32767: HTTP
 00:00:00.000141 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15188176, win 65535, length 0
 00:00:00.000051 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15188176:15220944, ack 86, win 65535, length 32768: HTTP
 00:00:00.000049 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15220944:15253711, ack 86, win 65535, length 32767: HTTP
 00:00:00.000129 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15253711, win 8268, length 0
 00:00:04.995986 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15253711:15261979, ack 86, win 65535, length 8268: HTTP
 00:00:00.000258 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15261979, win 65535, length 0
 00:00:00.000085 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15261979:15294747, ack 86, win 65535, length 32768: HTTP
 00:00:00.000092 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15294747:15327514, ack 86, win 65535, length 32767: HTTP
 00:00:00.000203 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15327514, win 65535, length 0
 00:00:00.000100 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15327514:15360282, ack 86, win 65535, length 32768: HTTP
 00:00:00.000083 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15360282:15393049, ack 86, win 65535, length 32767: HTTP
 00:00:00.000188 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15393049, win 65535, length 0
 00:00:00.000322 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15393049:15425817, ack 86, win 65535, length 32768: HTTP
 00:00:00.000078 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15425817:15458584, ack 86, win 65535, length 32767: HTTP
 00:00:00.000193 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15458584, win 65535, length 0
 00:00:00.000081 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15458584:15491352, ack 86, win 65535, length 32768: HTTP
 00:00:00.000073 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [P.], seq 15491352:15515858, ack 86, win 65535, length 24506: HTTP

Notice the 4.99 seconds delay (5s) after ACK before next packet from port 8080.

majek commented 4 years ago

Ok, here's what's happening in my opinion:

  1. Slirp is transmitting data, and getting relevant acks.
  2. At some point the sliding window gets reduced, to an unaligned value, like ~3 bytes.
  3. Slirp silly-window-avoidance decides to avoid sending 3-byte packet. Why should it? It's a waste. It correctly waits for > 1MTU window.
  4. netstack sees that the window is still 3 bytes, and avoids sending acks.
  5. Slirp times out after 5s, and sends the meager 3-byte packet anyway
  6. The connection uncorks.

On 3. This if is not going to "send". https://github.com/rootless-containers/slirp4netns/blob/master/vendor/libslirp/src/tcp_output.c#L165-L177 On 5. The force is set after the 5s timeout: https://github.com/rootless-containers/slirp4netns/blob/master/vendor/libslirp/src/tcp_output.c#L171-L172

I think that netstack suffers classical silly window syndrome problem. https://en.wikipedia.org/wiki/Silly_window_syndrome

Let me rephrase: netstack silly-window-syndrome protection only works when window size = 0, which is too restrictive. Netstack should be able to send ACKS to increase slinding window size even if it's not zero. Netstack should send ACK's to the sender when window size increases to like X*MSS, This doesn't seem to happen now. Netstack seem to wait for sending ack's until the window is exhausted.

hbhasker commented 4 years ago

Thank for the detailed analysis. I believe we look at the scaled window being zero. That is actual window size >> wndScale and unless window scaling is disabled that should at least be a few KB. But that said I believe we do not implement silly window syndrome avoidance fully and could make it better. I will revisit the zero window update logic or if you have cycles we would appreciate a pull request.

hbhasker commented 4 years ago

I believe the part we are missing is where the scaled window is much smaller than the MSS but not zero. Could you test something just set the so_rcvbuf to be really high like 16mb or something and see if that gets rid of the stall. I think our non zero window update logic does not take MSS into account.

hbhasker commented 4 years ago

Also looking at the slirp4netns code I believe it's using a copy of the FReebsd stack. I think we have seen similar behaviour when the other endpoint is a macos device. This doesn't happen if the other end is Linux because I think Linux has some workarounds for smaller than MSS windows which FreeBSD lacks.

hbhasker commented 4 years ago

Actually increasing so_rcvbuf might not work as the maximum scale is 14 and the mtu in use is 65kb. You may need to lower the MTU and increase so_rcvbuf.

majek commented 4 years ago

(for the record, my "server" in this test is slirp which does not support window scaling. We are limited to max window=65535)

hbhasker commented 4 years ago

I think that explains the 3 byte window. I believe the fix would be to make https://github.com/google/gvisor/blob/6fc9f0aefd89ce42ef2c38ea7853f9ba7c4bee04/pkg/tcpip/transport/tcp/endpoint.go#L1140 this respect MSS and consider < MSS a zero window. and probably this https://github.com/google/gvisor/blob/6fc9f0aefd89ce42ef2c38ea7853f9ba7c4bee04/pkg/tcpip/transport/tcp/rcv.go#L100

majek commented 4 years ago

Just an update - I tried running netstack-netstack (via tuntap internally, but it shouldn't matter), and it seems the same bug happens. That is - the connection gets stuck. (I failed to reproduce this reliably, will report with a testcase if I succeed)

majek commented 4 years ago

here is a patch that works for me:

diff --git a/pkg/tcpip/transport/tcp/endpoint.go b/pkg/tcpip/transport/tcp/endpoint.go
index fe629aa4..8c956f57 100644
--- a/pkg/tcpip/transport/tcp/endpoint.go
+++ b/pkg/tcpip/transport/tcp/endpoint.go
@@ -1142,7 +1142,7 @@ func (e *endpoint) zeroReceiveWindow(scale uint8) bool {
                return true
        }

-       return ((e.rcvBufSize - e.rcvBufUsed) >> scale) == 0
+       return ((e.rcvBufSize - e.rcvBufUsed) >> scale) == 0 || (e.rcvBufSize - e.rcvBufUsed) < int(e.route.MTU())
 }

 // SetSockOptInt sets a socket option.
@@ -2229,7 +2229,7 @@ func (e *endpoint) readyToRead(s *segment) {
                // we set the zero window before we deliver the segment to ensure
                // that a subsequent read of the segment will correctly trigger
                // a non-zero notification.
-               if avail := e.receiveBufferAvailableLocked(); avail>>e.rcv.rcvWndScale == 0 {
+               if avail := e.receiveBufferAvailableLocked(); avail>>e.rcv.rcvWndScale == 0 || avail < int(e.route.MTU()) {
                        e.stats.ReceiveErrors.ZeroRcvWindowState.Increment()
                        e.zeroWindow = true
                }
diff --git a/pkg/tcpip/transport/tcp/rcv.go b/pkg/tcpip/transport/tcp/rcv.go
index 0a553495..05c8488f 100644
--- a/pkg/tcpip/transport/tcp/rcv.go
+++ b/pkg/tcpip/transport/tcp/rcv.go
@@ -98,12 +98,6 @@ func (r *receiver) getSendParams() (rcvNxt seqnum.Value, rcvWnd seqnum.Size) {
 // in such cases we may need to send an ack to indicate to our peer that it can
 // resume sending data.
 func (r *receiver) nonZeroWindow() {
-       if (r.rcvAcc-r.rcvNxt)>>r.rcvWndScale != 0 {
-               // We never got around to announcing a zero window size, so we
-               // don't need to immediately announce a nonzero one.
-               return
-       }
-
        // Immediately send an ack.
        r.ep.snd.sendAck()
 }

So three changes, in reverse order: (1) on notifyNonZeroReceiveWindow flag, unconditionally send the ACK in nonZeroWindow(). I can't imagine a case when this would not be true. I don't understand how notifyNonZeroReceiveWindow can be set when window is not zero... The check in nonZeroWindow() looks spurious, or I don't get it.

(2) in readyToRead treat recv buffer space <MTU as zero window.

(3) in zeroReceiveWindow, treat recv buffer space < MTU as zero window.

Food for thought: is it possible to have receive buffer size < MTU? Is it possible for path MTU to change while we are in zero window state? How to test this?

tamird commented 4 years ago

Fixed by #1534.