Black hole detection false-positives

quinn-rs / quinn

Async-friendly QUIC implementation in Rust

Apache License 2.0

3.54k stars 360 forks source link

Black hole detection false-positives #1855

Closed Ralith closed 1 month ago

Ralith commented 1 month ago

Black hole detection triggers when the number of "suspicious loss bursts" since the last MTU-sized packet exceeds a threshold:

https://github.com/quinn-rs/quinn/blob/a35ad73f97d7c4087c7c62b1650c589d2c0637ea/quinn-proto/src/connection/mtud.rs#L398-L409

Due to approximations in packet assembly, we (almost?) never actually produce an exactly MTU-sized packet. As a result, any loss rate of packets larger than the minimum threshold to be suspicious will eventually trigger black hole detection, resulting in a spurious MTU reduction and loss of throughput.

We should relax the conditions for resetting the counter. For example, if we've confirmed delivery of a packet of a certain size, loss bursts containing packets of equal or smaller size shouldn't be considered suspicious any more.

Ralith commented 1 month ago

Design proposal:

A loss burst is said to be suspicious (possibly caused by an MTU reduction) if its smallest packet is larger than any more recently acknowledged packet. Packets are deemed lost when at most one packet threshold (typically set to 3) worth of later packets have been acknowledged. We therefore need to store only at most one packet threshold worth of packet sizes, those of the most recently acknowledged packets not preceding a loss burst, to judge that a future loss burst might be caused by an MTU reduction. Past loss bursts are forgotten immediately upon acknowledgement of a larger packet than their smallest. If the number of past loss bursts exceeds a threshold, then we report a black hole.

djc commented 1 month ago

Sounds okay. Should we run this by quicdev or something?

Edit: ah, I saw you already did. Should have checked first. 😄

Ralith commented 1 month ago

Some thoughts from that discussion:

Some paths don't have a stable MTU, e.g. ones that internally load-balance over heterogeneous paths at a lower level. These are very difficult to detect, perhaps best dealt with manually (complain to your network administrator and/or configure a maximum MTU).
Clamping MTU to the largest packet successfully delivered in some fixed window is safe, but prone to ratcheting downwards with application datagrams.
Probing regularly and/or when suspicious loss occurs helps provide a source of truth when application traffic isn't so obliging, but isn't free.

Still seems to me like the approach I sketched above is an attractive sweet spot.

djc commented 1 month ago

A loss burst is said to be suspicious (possibly caused by an MTU reduction) if its smallest packet is larger than any more recently acknowledged packet.

In reviewing #1858, I'm wondering if this holds? If the MTU reduces, why would we only consider losses to be suspicious only if the smallest packet is larger than any more recently acknowledged packet? The MTU could have decreased past whatever was recently acknowledged, right?

Ralith commented 1 month ago

When a packet is acknowledged, we know that the MTU is at least that packet's size. If a loss burst includes packets that are smaller than a packet that was transmitted later and which has been acknowledged, then we know at least some of the packets in that burst were not dropped for MTU reasons, so most likely the entire burst was lost due to congestion or link errors.