Closed majek closed 4 years ago
Thanks for the report. I will take a look today. Btw PacketMMap mode is buggy and should not be used. Do you have clear set of steps for repro.
Ah nm I see the script. Willl try it out.
Ok, I have some progress. The fingers now are pointed at slirp. This is tcpdump (real one) on tuntap in the gvisor-sandbox namespace:
$ nsenter -n -t `pidof runsc-sandbox` tcpdump -ni eth0 -B 4096 -w x.pcap
$ tcpdump -n -r x.pcap -ttt
...
00:00:00.000047 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [P.], seq 15057103:15057106, ack 86, win 65535, length 3: HTTP
00:00:00.000144 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15057106, win 65535, length 0
00:00:00.000288 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15057106:15089874, ack 86, win 65535, length 32768: HTTP
00:00:00.000052 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15089874:15122641, ack 86, win 65535, length 32767: HTTP
00:00:00.000136 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15122641, win 65535, length 0
00:00:00.000054 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15122641:15155409, ack 86, win 65535, length 32768: HTTP
00:00:00.000051 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15155409:15188176, ack 86, win 65535, length 32767: HTTP
00:00:00.000141 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15188176, win 65535, length 0
00:00:00.000051 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15188176:15220944, ack 86, win 65535, length 32768: HTTP
00:00:00.000049 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15220944:15253711, ack 86, win 65535, length 32767: HTTP
00:00:00.000129 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15253711, win 8268, length 0
00:00:04.995986 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15253711:15261979, ack 86, win 65535, length 8268: HTTP
00:00:00.000258 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15261979, win 65535, length 0
00:00:00.000085 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15261979:15294747, ack 86, win 65535, length 32768: HTTP
00:00:00.000092 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15294747:15327514, ack 86, win 65535, length 32767: HTTP
00:00:00.000203 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15327514, win 65535, length 0
00:00:00.000100 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15327514:15360282, ack 86, win 65535, length 32768: HTTP
00:00:00.000083 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15360282:15393049, ack 86, win 65535, length 32767: HTTP
00:00:00.000188 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15393049, win 65535, length 0
00:00:00.000322 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15393049:15425817, ack 86, win 65535, length 32768: HTTP
00:00:00.000078 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15425817:15458584, ack 86, win 65535, length 32767: HTTP
00:00:00.000193 IP 10.0.2.100.38203 > 10.0.2.2.8080: Flags [.], ack 15458584, win 65535, length 0
00:00:00.000081 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [.], seq 15458584:15491352, ack 86, win 65535, length 32768: HTTP
00:00:00.000073 IP 10.0.2.2.8080 > 10.0.2.100.38203: Flags [P.], seq 15491352:15515858, ack 86, win 65535, length 24506: HTTP
Notice the 4.99 seconds delay (5s) after ACK before next packet from port 8080.
Ok, here's what's happening in my opinion:
On 3. This if is not going to "send".
https://github.com/rootless-containers/slirp4netns/blob/master/vendor/libslirp/src/tcp_output.c#L165-L177
On 5. The force
is set after the 5s timeout:
https://github.com/rootless-containers/slirp4netns/blob/master/vendor/libslirp/src/tcp_output.c#L171-L172
I think that netstack suffers classical silly window syndrome problem. https://en.wikipedia.org/wiki/Silly_window_syndrome
Let me rephrase: netstack silly-window-syndrome protection only works when window size = 0, which is too restrictive. Netstack should be able to send ACKS to increase slinding window size even if it's not zero. Netstack should send ACK's to the sender when window size increases to like X*MSS, This doesn't seem to happen now. Netstack seem to wait for sending ack's until the window is exhausted.
Thank for the detailed analysis. I believe we look at the scaled window being zero. That is actual window size >> wndScale and unless window scaling is disabled that should at least be a few KB. But that said I believe we do not implement silly window syndrome avoidance fully and could make it better. I will revisit the zero window update logic or if you have cycles we would appreciate a pull request.
I believe the part we are missing is where the scaled window is much smaller than the MSS but not zero. Could you test something just set the so_rcvbuf to be really high like 16mb or something and see if that gets rid of the stall. I think our non zero window update logic does not take MSS into account.
Also looking at the slirp4netns code I believe it's using a copy of the FReebsd stack. I think we have seen similar behaviour when the other endpoint is a macos device. This doesn't happen if the other end is Linux because I think Linux has some workarounds for smaller than MSS windows which FreeBSD lacks.
Actually increasing so_rcvbuf might not work as the maximum scale is 14 and the mtu in use is 65kb. You may need to lower the MTU and increase so_rcvbuf.
(for the record, my "server" in this test is slirp which does not support window scaling. We are limited to max window=65535)
I think that explains the 3 byte window. I believe the fix would be to make https://github.com/google/gvisor/blob/6fc9f0aefd89ce42ef2c38ea7853f9ba7c4bee04/pkg/tcpip/transport/tcp/endpoint.go#L1140 this respect MSS and consider < MSS a zero window. and probably this https://github.com/google/gvisor/blob/6fc9f0aefd89ce42ef2c38ea7853f9ba7c4bee04/pkg/tcpip/transport/tcp/rcv.go#L100
Just an update - I tried running netstack-netstack (via tuntap internally, but it shouldn't matter), and it seems the same bug happens. That is - the connection gets stuck. (I failed to reproduce this reliably, will report with a testcase if I succeed)
here is a patch that works for me:
diff --git a/pkg/tcpip/transport/tcp/endpoint.go b/pkg/tcpip/transport/tcp/endpoint.go
index fe629aa4..8c956f57 100644
--- a/pkg/tcpip/transport/tcp/endpoint.go
+++ b/pkg/tcpip/transport/tcp/endpoint.go
@@ -1142,7 +1142,7 @@ func (e *endpoint) zeroReceiveWindow(scale uint8) bool {
return true
}
- return ((e.rcvBufSize - e.rcvBufUsed) >> scale) == 0
+ return ((e.rcvBufSize - e.rcvBufUsed) >> scale) == 0 || (e.rcvBufSize - e.rcvBufUsed) < int(e.route.MTU())
}
// SetSockOptInt sets a socket option.
@@ -2229,7 +2229,7 @@ func (e *endpoint) readyToRead(s *segment) {
// we set the zero window before we deliver the segment to ensure
// that a subsequent read of the segment will correctly trigger
// a non-zero notification.
- if avail := e.receiveBufferAvailableLocked(); avail>>e.rcv.rcvWndScale == 0 {
+ if avail := e.receiveBufferAvailableLocked(); avail>>e.rcv.rcvWndScale == 0 || avail < int(e.route.MTU()) {
e.stats.ReceiveErrors.ZeroRcvWindowState.Increment()
e.zeroWindow = true
}
diff --git a/pkg/tcpip/transport/tcp/rcv.go b/pkg/tcpip/transport/tcp/rcv.go
index 0a553495..05c8488f 100644
--- a/pkg/tcpip/transport/tcp/rcv.go
+++ b/pkg/tcpip/transport/tcp/rcv.go
@@ -98,12 +98,6 @@ func (r *receiver) getSendParams() (rcvNxt seqnum.Value, rcvWnd seqnum.Size) {
// in such cases we may need to send an ack to indicate to our peer that it can
// resume sending data.
func (r *receiver) nonZeroWindow() {
- if (r.rcvAcc-r.rcvNxt)>>r.rcvWndScale != 0 {
- // We never got around to announcing a zero window size, so we
- // don't need to immediately announce a nonzero one.
- return
- }
-
// Immediately send an ack.
r.ep.snd.sendAck()
}
So three changes, in reverse order: (1) on notifyNonZeroReceiveWindow flag, unconditionally send the ACK in nonZeroWindow(). I can't imagine a case when this would not be true. I don't understand how notifyNonZeroReceiveWindow can be set when window is not zero... The check in nonZeroWindow() looks spurious, or I don't get it.
(2) in readyToRead treat recv buffer space <MTU as zero window.
(3) in zeroReceiveWindow, treat recv buffer space < MTU as zero window.
Food for thought: is it possible to have receive buffer size < MTU? Is it possible for path MTU to change while we are in zero window state? How to test this?
Fixed by #1534.
Good morning!
I'm running tuntap interface inside runsc-sandbox namespace, and slirp4netns (a userspace network stack) on the host side. Therefore all networking goes:
Setting this up is not that hard, take a look at the prepared OCI-gvisor reproducer script: https://gist.github.com/majek/bede99c29bd49060a20128ccf720af21
This setup doesn't work well in
--network=sandbox
mode. In--network=host
mode, though, it's totally fine. For testing I'm running a simple http server on the host side, on host loopback.In
--network=host
mode downloading large asset works fine:Notice curl on
--network=host
, going over slirp4netns reports ~100MiB/s. Good.Now, when we switch to netstack
--network=sandbox
the performance drops off the cliff:With speed reaching up to meager 330KiB/s. The framework is consuming no cpu. I suspect some congestion control issues or maybe just a transmitssion timeout bug in netstack.
When debugging this, I tried:
Passing "--keepalive-time 1" to curl does somehow alleviate the problem - I get ~5MiB/s rate.
I suspect the problem is that for some reason netstack gets stuck when talking to slirp implementation. As far as I can tell, slirp is behaving well, politely waiting for new packets with epoll.
The netstack though, I think gets stuck in the particular case of fast-speed connectivity with slirp. This seem to be confirmed by the fact that slowing down slirp (by stracing it) makes the problem disappear. Cheers to the heisenbugs.
Please let me know how to debug this issue, and how to handle netstack performance / transmission problems.