TCP connections can stall when in-flight data exceeds 25% of receive buffer

spikecurtis commented 1 year ago

Description

When processing received segments, it appears that only 25% of the receive buffer is available for out-of-order segments.

The default receive buffer is 1 MiB, and so this translates to 256 kiB for out-of-order segments. In our testing with wireguard, the amount of "in-flight" data in the network can easily exceed 256 kiB. So, if a segments is dropped, we fill up the space allowed for out-of-order segments before a retransmit could reach the receiver and the initial loss triggers a new, larger loss.

Furthermore, the discarded segments are naturally at the end of the receive window, and so the sender's pipe() or capacity calculation will never consider those segments "lost." This, coupled with the halving of the ssthresh and cwnd, can cause the TCP connection to stall during or immediately following recovery. The sender does not detect that the dropped segments have left the network, so it stops sending new data.

The comment above the if-test for whether to store an out-of-order segment says that 75% of the buffer should be available, but the code checks against 25% (rcvBufSize>>2). Which is right? Notably, the receiver sets its window size to a max of 50% of the receive buffer, so if the out-of-order buffer were allowed to be 75%, then the sender would stop transmitting before it puts enough data into the network to overwhelm it.

Even the RTO doesn't un-stall the stack if there are multiple packets lost because the sender still considers the dropped out of order segments to be in the network, so depending on how much data the receiver is able to ACK after RTO, the sender may still be stalled by cwnd until multiple RTOs send enough data to get the receiver to ACK back below the cwnd.

Also a pcap that shows the behavior: send.tgz

You'll need ts-dissector.tgz to view it in Wireshark, e.g.

wireshark -X lua_script:ts-dissector.lua send.pcap

(we captured via tailscale).

In the PCAP, you'll see an initial burst of Dup ACKs from the packets that were in-flight in the network, and eventually the SACK Right Edge (SRE) stops increasing (around packet 25002)---this is when the receiver stops storing the segments, but it does continue to send Dup ACKs.

Then, the sender does a fast retransmit (packet 25101), but can't proceed even after it is ACK'd (packet 25255), and has to wait for the RTO. It then gets stuck sending one packet per RTO (200 ms) for 20 seconds until it patches enough of the hole.

Steps to reproduce

Reproduction steps are here from another issue we mistakenly believed was related.

runsc version

n/a - reproducible in pure go

docker version (if using docker)

n/a

uname

Linux dogfood2 5.19.17-051917-generic #202210240939 SMP PREEMPT_DYNAMIC Mon Oct 24 09:43:01 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

n/a

repo state (if built from source)

No response

runsc debug logs (if available)

No response

EtiennePerot commented 1 year ago

/cc @kevinGC

kevinGC commented 1 year ago

Thanks for the super detailed report. Increasing the usable buffer size for out-of-order packets seems like the right choice.

It's not obvious to me why the 25% limit is imposed, or for that matter why there's any special limitation. It seems like we should be respecting the window size just as we would in any other case. I'll make sure there wasn't some reason I'm missing before we change the limit.

kevinGC commented 1 year ago

This should be fixed now -- please let us know if you still see it happening.

google / gvisor