private-octopus / picoquic

Minimal implementation of the QUIC protocol
MIT License
540 stars 159 forks source link

Delay spikes caused degraded conditions on Wi-Fi networks #1481

Closed huitema closed 1 month ago

huitema commented 1 year ago

I knew that Wi-Fi was sometimes kind of bad, but I did not look closely until someone complained about the perfs of QUIC over WiFi. Just did a 2 minute sample of ping delays, and I get something like:

We can see that on this graph:

image

That seems to me happening in many places, colleagues are reporting similar data.I did a series of ping at 0.1sec intervals, and I see the "spike event" spending 8 to 12 ping times, probably lasting about a second.

image

We are getting reports that loss recovery algorithm gets very confused, congestion control kicks in, etc.

The test problem making the program obvious were sending QUIC datagrams at a high rate (600 datagrams per second). First step would be to try reproduce the problem in a simulation.

TimEvens commented 1 year ago

https://www.privateoctopus.com/2023/05/18/the-weird-case-of-wifi-latency-spikes.html

This problem is still there. At this point we do not believe this is an issue with standard congestion control algorithms such as bbr, newreno, ...

We have tried bbr, newreno, and cubic in picoquic. All of them have the same behavior with the latency spikes. Seems a bit odd that they all have the same behavior where the callbacks are delayed per stream (including datagram) during the latency spike. Please note that latency spike results in ZERO packet loss, it's just a latency spike of ~100 - 150ms for a few packets over a span of a few seconds. Not all packets are delayed during the spike.

The challenge we are running into is that picoquic callbacks are used to service bytes to transmit. Each callback (per stream or datagram) transmits up to max length (mtu or less) of data. When the callback is delayed by 100ms a few times, it results in an accumulation of pending transmit data.

huitema commented 1 year ago

When a spike occurs, Wi-Fi operation is suspended for 100-250ms. The Wi-Fi station sends a message to the Wi-Fi base station to "please queue data for 250ms, and ping me after that", and then it goes to sleep. Nothing is sent on the link, nothing is received. Technically, it is possible to send of the UDP socket -- the packets will be queued inside the kernel, and will only be sent when Wi-Fi operation resumes, after the end of the event.

Picoquic will keep "sending" as long as congestion control and flow control allow. But congestion control only allows a few buffers to be sent, corresponding to the usual RTT. After that, sending new data will be blocked. The "prepare to send" callback will not happen, because the stack cannot send new data in the absence of congestion control credits. They will only resume after the end of the event, when the stack will receive new acknowledgements.

That behavior is in line with the "just in time" philosophy. That pretty much means that the queues will be kept inside the application, instead of building in front of the Wi-Fi driver. In theory, if the application was smart, it could avoid building queues and just send the latest data -- but in practice it will just queue data and send them later.

huitema commented 1 month ago

This has been resolved by the work on "WI-Fi suspension" in BBR.