Open jwhited opened 11 months ago
Adding @nybidari, who knows more about RACK.
Interesting that this is Windows-only, as I wouldn't expect that to matter. Maybe something to do with timers (since RACK is time-based) is OS-dependent?
Adding @nybidari, who knows more about RACK.
Interesting that this is Windows-only, as I wouldn't expect that to matter. Maybe something to do with timers (since RACK is time-based) is OS-dependent?
FWIW I have tested with higher resolution timing, but found no difference in the results:
err := windows.TimeBeginPeriod(1)
if err != nil {
panic(err)
}
I just now realized that tcpip.TCPRACKStaticReoWnd
and tcpip.TCPRACKNoDupTh
are meant to mask on top of tcpip.TCPRACK
, and they are unused anyway. So when I was using those values it was the same as no RACK. Removed that bit from the description.
I don't think RACK does anything different on windows compared to other operating systems. From my understanding, RACK performance can be lower than other congestion control algorithms in these cases:
These are just my speculations, the root cause can be something else also! To debug further, would it be possible to get these TCP stats for with and without RACK on windows: https://github.com/google/gvisor/blob/master/pkg/tcpip/tcpip.go#L2123-L2146 ?
To debug further, would it be possible to get these TCP stats for with and without RACK on windows: https://github.com/google/gvisor/blob/master/pkg/tcpip/tcpip.go#L2123-L2146 ?
30 second throughput test
Windows Server 2022 No TCP-RACK ~80Mb/s:
2023/11/30 00:57:00 Retransmits: 3299 FastRecovery: 0 SACKRecovery: 52 TLPRecovery: 0 SlowStartRetransmits: 1653 FastRetransmit: 52 Timeouts: 10
Windows Server 2022 TCP-RACK ~8Mb/s:
2023/11/30 00:59:40 Retransmits: 1430 FastRecovery: 0 SACKRecovery: 690 TLPRecovery: 0 SlowStartRetransmits: 4 FastRetransmit: 687 Timeouts: 4
Ubuntu 22.04 No TCP-RACK ~90Mb/s:
2023/11/30 01:05:31 Retransmits: 4251 FastRecovery: 0 SACKRecovery: 66 TLPRecovery: 0 SlowStartRetransmits: 2690 FastRetransmit: 66 Timeouts: 15
Ubuntu 22.04 TCP-RACK ~80Mb/s:
2023/11/30 01:03:07 Retransmits: 2220 FastRecovery: 0 SACKRecovery: 64 TLPRecovery: 0 SlowStartRetransmits: 3 FastRetransmit: 64 Timeouts: 1
A friendly reminder that this issue had no activity for 120 days.
@nybidari any findings? Any reason to believe a more recent release would improve RACK on windows?
No findings, and no features have targeted this specifically. Wish we had more bandwidth to investigate.
A friendly reminder that this issue had no activity for 120 days.
Description
Our usage of netstack within tailscale performs poorly on Windows with the following stack settings:
reno
)tcpip.TCKSACKEnabled(true)
tcpip.TCPRACKLossDetection
)Using
Stack.AddTCProbe()
to print congestion window (in packets) shows the window being held below 10 packets during a throughput test:Throughput is poor (8Mb/s). Changing TCP loss recovery to 0 (no TCP-RACK) results in significantly improved throughput by a factor of ~10 (8Mb/s => 80Mb/s). Congestion window moves in a more expected fashion. Path under test is not particularly lossy.
Linux does not exhibit the same behavior/issue. This appears to be Windows-specific. Reproduced by multiple users in multiple environments across Windows 11 and Windows Server 2022.
Originally reported via https://github.com/tailscale/tailscale/issues/9707
Steps to reproduce
https://github.com/tailscale/tailscale/issues/9707#issuecomment-1752175564 describes steps to reproduce using tailscale. We have since changed loss recovery on Windows as a workaround via https://github.com/tailscale/tailscale/commit/5e861c38718ffcde3ded6d2922ca464886e41321.
Reproduced at both gVisor HEAD (4b4191b8cad1f5f1a99be76d8dae59b713e58ff5) and what tailscale is currently using (4fe30062272c)
runsc version
No response
docker version (if using docker)
No response
uname
No response
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
No response