quinn-rs / quinn

Async-friendly QUIC implementation in Rust
Apache License 2.0
3.58k stars 366 forks source link

Congestion window grows without bound when CPU-bound #1126

Open Ralith opened 3 years ago

Ralith commented 3 years ago

If throughput is limited by CPU resources, then every non-app-limited transmit will, once ACKed, cause the congestion window to grow. If CPU resources abruptly increase (e.g. due to reduced contention from other tasks and/or processes) then we risk flooding the link. The congestion controller interface/implementation should be modified such that success with N in-flight bytes is interpreted as license to grow the congestion window by at most a constant multiple of N.

Matthias247 commented 3 years ago

When implementing https://github.com/quinn-rs/quinn/pull/1127 I noticed that cwnd growth depends heavily on the setting of the new MAX_TRANSMIT_DATAGRAMS flag. When configured lower, CWND is more likely to grow to infinitiy since a single poll will never yield all available datagrams, which leads app_limited set to false and allows CWND to grow. If we increase the value and all datagrams are polled then app_limited might get set to true which stops CWND from growing. However that also means higher latencies.

Some examples:

MAX_TRANSMIT_DATAGRAMS = 20:

path: PathStats {
        rtt: 389.489µs,
        cwnd: 536692490,

MAX_TRANSMIT_DATAGRAMS = 30:

 path: PathStats {
        rtt: 478.052µs,
        cwnd: 2821703,

MAX_TRANSMIT_DATAGRAMS = 50:

path: PathStats {
        rtt: 646.804µs,
        cwnd: 1481010,

MAX_TRANSMIT_DATAGRAMS = 100:

path: PathStats {
        rtt: 533.493µs,
        cwnd: 1476074,

While this benchmark looks like 30 would be a better value then 20 I would remind that I'm running this on a desktop with extremely high single-threaded performance, and pretty much everything else would have a harder time producing 30 packets and have higher latencies.

Overall I think the way app_limited is set at the moment is not optimal, and there needs to be a strategy improvement

Ralith commented 2 years ago

What if we maintain a separate cpu_limited flag, set it whenever we stop calling poll_transmit early, and unset it whenever poll_transmit is fully drained?

Matthias247 commented 2 years ago

Interesting thought. But it kind of relies on whether the application does call poll_transmit again as early as possible, and not delays it for other reasons - right? The quinn wrapper does, but it would be some kind of implicit contract (similar to the read/EAGAIN/epoll ones.

The other question is whether this would already been improved upon if #1131 gets addressed by merging tasks. I'm not sure, haven't thought too deeply about this.

Ralith commented 2 years ago

Interesting thought. But it kind of relies on whether the application does call poll_transmit again as early as possible, and not delays it for other reasons - right?

So long as the polling pattern is consistent, I think the behavior would be correct wrt. the amount of CPU time being given. If polling is arbitrarily erratic I don't think we can hope to guarantee good behavior regardless.

The other question is whether this would already been improved upon if #1131 gets addressed by merging tasks

I don't think so; the fundamental problem is that the current construction of app_limited assumes we'll always be able to drain the connection's logical queues in a timely fashion, and when we're CPU bound we by definition can't, regardless of how work is distributed. Performance might get better, but the design flaw stands.