quinn-rs / quinn

Async-friendly QUIC implementation in Rust
Apache License 2.0
3.76k stars 381 forks source link

Degraded performance with elevated RTT - especially on windows #1409

Open Matthias247 opened 2 years ago

Matthias247 commented 2 years ago

I've been playing a bit around with latency injection and measuring throughput. The setup is probably slightly broken and needs some more tuning, but it still already showed some surprising results.

Here's the measured throughput in the bulk benchmark for downloading 100MB of data when a given delay is injected for both directions (total RTT is 2 times that delay).

Delay Windows Linux
0ms 117MB/s 454MB/s
1ms 4MB/s 55MB/s
2ms 3.7MB/s 35MB/s
10ms 5.8MB/s 30MB/s
50ms 3.3MB/s 10MB/s
200ms 2.13MB/s 2.62MB/s

The variants with extra latency are not CPU bound - the library simply doesn't want to send data faster. If I run them for longer, the average throughput actually increases, which indicates the congestion controller is still raising the congestion window. This is also confirmed by stats.

E.g. for a 10ms delay

Delay Windows 100MB Windows 200MB Linux 100MB Linux 200MB
10ms 3.9MB/s 5.18MB/s 31MB/s 30MB/s

Changing congestion control to BBR makes it ramp up faster and get better numbers, but it still isn't great.

I'm not fully sure what causes the degradation even on Linux not the CPU bound, but on windows I noticed the following:

When 1ms latency (2ms RTT) is injected, the stats show a much higher RTT:

path: PathStats {
        rtt: 31.6284ms,
        cwnd: 945959,
        congestion_events: 11,
        lost_packets: 77,
        lost_bytes: 92400,
        sent_packets: 89928,
    },

So besides the 2ms latency we wanted to have, we actually get 30ms latency extra.

Compared on Linux:

path: PathStats {
        rtt: 4.368618ms,
        cwnd: 281805,
        congestion_events: 12,
        lost_packets: 275,
        lost_bytes: 330000,
        sent_packets: 90114,
    },

There's 2ms extra latency.

A bit more digging showed the extra latency is introduced by tokio timer precision (https://github.com/tokio-rs/tokio/issues/5021). That causes the network simulation to forward packets later than intended - which would be a simulation-only issue. However the library should still compensate for the higher RTT by trying to increase the congestion window even more. It seems like won't do that due to pacing: With pacing, the full congestion window isn't used at once - instead packets are sent out in 2ms intervals, and being spaced out by timers. When the associated timer makes 16ms out of that 2ms, most of the congestion window isn't used. And it might not even be increased due to being deemed app-limited (not sure).

I tried disabling pacing, and indeed it increases throughput

Delay Windows Linux (default socket buffers) Linux (2MB socket buffers)
10ms 30MB/s 5.8MB/s 48.5 MB/s

So the lack of timer precision in combination with pacing indeed limits throughput. Since the simulation and the impact of timer precision on that one further impacts results, it would however be nice to verify this in a real deployment.

I assume in a real world deployment where the peer has good pacing and acknowledges packets more often, the difference would be less strong since the endpoint is also woken up by packets from the peer instead of just from timers.

Matthias247 commented 2 years ago

I hacked up a higher precision timer for the network simulation (using a background thread and https://crates.io/crates/spin_sleep/1.1.1). This gets the windows version from 5MB/s to 50MB/s at 10ms delay - just like the Linux version. Both with pacing.

Unfortunately the better simulation wakes up the endpoint often enough that pacing accidentally also kicks in with higher precision. So this setup doesn't show yet what the impact of missed pacing timers on end users is. But I assume it's would be around the same degradation - towards 5MB/s. And less if less data is transmitted.