quinn-rs / quinn

Async-friendly QUIC implementation in Rust
Apache License 2.0
3.76k stars 380 forks source link

Assertion failure when sending a lot of datagrams #1413

Closed jfro closed 2 years ago

jfro commented 2 years ago

This is same assert as #1196, reproducing in debug config & a setup that sends 1000 datagrams 1ms apart.

https://github.com/jfro/quictest

just cargo run should demonstrate it. currently happens every time on my MBP M1, haven't checked it machine matters though.

Ralith commented 2 years ago

This bug is not triggered by sending large numbers of datagrams, but rather by sending datagrams at or near the maximum size while there are ACKs pending. We include all unacknowledged ACK frames at the start of every outgoing packet, which consumes a portion of the MTU not accounted for by the datagram size limit. This leads to an unexpected failure to dequeue the application datagram, triggering an assert due to an unexpected composition of an ACK-only packet in debug builds, and leading to unreasonably high rates of datagram loss in release builds.

We should solve this by including ACK frames less frequently, ensuring packets with capacity for a maximum-size datagram can regularly be sent. However, this must be balanced against RFC9000's requirements for the timely transmission of ACKs. Tentatively, I think a reasonable strategy would be to include an ACK iff we have received an ACK-eliciting packet since we last transmitted, and perhaps also if our most recently sent ACK-bearing packet is deemed lost.

Ralith commented 2 years ago

The logic outlined above was already implemented in https://github.com/quinn-rs/quinn/commit/a2fe103f4b5f08e8c7a623ab7da1c4b6d17e4749, which is in HEAD but not the 0.8 branch. This prevents the unreasonable loss rates observed in 0.8. The assert can still be triggered whenever both a fresh ACK and a full-size datagram is pending, because a single packet cannot contain both an ACK frame and a full-sized datagram, and the current behavior gives precedence to the ACK frame.

If an ACK-only packet is not currently permitted, we must instead give precedence to the datagrams, else we are in violation of RFC 9000 §13.2.1. Conversely, when an ACK-only packet is permitted, to guarantee that a timely ACK is transmitted at all we must send one even if we have full-sized datagrams queued that could be sent instead.

A simple solution would be to take the above commit a step further, and only send ACKs on the receipt of ACK-eliciting packets (i.e. when an ACK-only packet is permissible), rather than after receipt of any packet. This should also simplify our logic slightly.

Matthias247 commented 2 years ago

There seem to be multiple strategies to avoid this:

  1. Make sure large sized datagram packets don't include ACKs, and send ACKs separately if necessary
  2. Reserve a certain size for ACK data, and maybe limit the max amount of ACK data that is used in the library (which might already be done?)

It seems like the fix prefers 1) . Why is that over 2)? Because it allows for larger frames, and the typical stream use case doesn't care how large ACK blocks are?

Ralith commented 2 years ago

We do have a hard upper bound on number of ACK ranges (which bounds ACK frame size), but it's not tiny, so it'd significantly reduce the maximum application datagram size. We could make it smaller, but I'm not sure where to draw the balance, and just sending it in a separate packet when necessary means we don't have to, while maximizing the efficiency of application datagram use.