aochagavia commented 1 year ago

Overview of the feature:

The ack frequency extension allows one to control the ACK frequency of the remote peer.
Quinn provides an AckFrequencyConfig to control the parameters, which are sent to the peer in an ACK_FREQUENCY frame at the beginning of the connection (technically, the extension allows modifying the ack frequency parameters multiple times during a connection, but then you would need to come up with an algorithm to automatically tune the parameters, which IMO is out of scope for this first implementation).
On the other side of the connection, Quinn will receive and process the ACK_FREQUENCY frame, updating the local ack frequency parameters based on the new parameters sent by the peer.
If the local peer has requested a custom ack frequency from the remote peer, Quinn will send an IMMEDIATE_ACK frame every RTT if there are any unacked ack-eliciting packets (this is a new kind of frame to immediately elicit an ACK). This is recommended by the draft as a way to ensure congestion control and loss detection work properly.
The draft is ambiguous when it comes to out-of-order packet detection. I wrote the specific details in this issue, hoping it gets clarified. In the meantime, I am sticking to the text of the draft and ignoring the parts in the examples that deviate from the draft's definitions.

I have a bunch of questions / comments:

I introduced two timers, one to be notified when max_ack_delay is elapsed, and one to be notified when an RTT is elapsed. I tried to come up with an approach that wouldn't require additional timers, but I'm not sure it is possible. It would be nice if you can double-check that, given your knowledge of the codebase.
I introduced a reset_rtt_timer function that arms the Timer::Rtt for the first time. I wanted to arm it right after we have reached the Data space and have an RTT estimate. I ended up calling it inside process_payload. Please let me know if there is a more suitable place.
I rewrote a long comment inside PendingAcks::acks_sent, which I found difficult to understand. Hopefully the new version says the same in clearer terms (otherwise let me know and I'll update it)

aochagavia commented 1 year ago

And... here are some benchmarks!

Practical details

I'm running the server using cargo run --release --bin perf_server -- --no-protection and the client using cargo run --release --bin perf_client -- --no-protection --duration 15 (note that this makes the benchmarks misleading for real-world use cases, because encryption is disabled). As you can deduce from the cargo commands used, I'm running both benchmarks on the same machine.

For the scenarios with ACK frequency enabled I'm doing so through the following code:

let mut ack_freq = AckFrequencyConfig::default();
ack_freq.ack_eliciting_threshold(10);
transport.ack_frequency_config(Some(ack_freq));

Conclusion

There seems to be no significant performance difference in the benchmarks, though we probably need a more statistically rigorous approach to benchmarking if we want to be sure. @stormshield-damiend suggested we might see clearer differences in performance if there is higher latency (instead of it being in the order of microseconds), so I'll benchmark again next week with a more realistic latency.

Baseline (before this PR)

      │ Upload Duration │ Download Duration | FBL        | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
 AVG  │        862.00µs │          811.00µs │    50.00µs │     1199.97 MiB/s │       1293.80 MiB/s
 P0   │        661.00µs │          534.00µs │     1.00µs │       32.34 MiB/s │         34.34 MiB/s
 P10  │        741.00µs │          678.00µs │     2.00µs │      986.50 MiB/s │       1089.00 MiB/s
 P50  │        810.00µs │          762.00µs │    47.00µs │     1235.00 MiB/s │       1313.00 MiB/s
 P90  │          1.01ms │          919.00µs │    79.00µs │     1350.00 MiB/s │       1475.00 MiB/s
 P100 │         30.91ms │           29.12ms │   470.00µs │     1513.00 MiB/s │       1873.00 MiB/s

ACK frequency disabled (threshold = 1)

      │ Upload Duration │ Download Duration | FBL        | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
 AVG  │        920.00µs │          805.00µs │    53.00µs │     1190.95 MiB/s │       1280.93 MiB/s
 P0   │        670.00µs │          134.00µs │     1.00µs │       33.44 MiB/s │         34.12 MiB/s
 P10  │        746.00µs │          699.00µs │     4.00µs │      996.50 MiB/s │       1113.00 MiB/s
 P50  │        819.00µs │          774.00µs │    49.00µs │     1222.00 MiB/s │       1292.00 MiB/s
 P90  │          1.00ms │          899.00µs │    82.00µs │     1341.00 MiB/s │       1431.00 MiB/s
 P100 │         29.92ms │           29.30ms │   847.00µs │     1493.00 MiB/s │       7464.00 MiB/s

ACK frequency enabled on the client (threshold = 10)

      │ Upload Duration │ Download Duration | FBL        | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
 AVG  │        866.00µs │          767.00µs │    48.00µs │     1187.26 MiB/s │       1479.18 MiB/s
 P0   │        686.00µs │           70.00µs │     1.00µs │       33.25 MiB/s │         32.25 MiB/s
 P10  │        753.00µs │          680.00µs │     3.00µs │     1020.50 MiB/s │       1198.00 MiB/s
 P50  │        821.00µs │          749.00µs │    46.00µs │     1219.00 MiB/s │       1336.00 MiB/s
 P90  │        980.00µs │          835.00µs │    77.00µs │     1329.00 MiB/s │       1471.00 MiB/s
 P100 │         30.08ms │           30.99ms │   421.00µs │     1458.00 MiB/s │      14288.00 MiB/s

ACK frequency enabled on the server (threshold = 10)

      │ Upload Duration │ Download Duration | FBL        | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
 AVG  │        890.00µs │          844.00µs │    51.00µs │     1182.28 MiB/s │       1202.34 MiB/s
 P0   │        684.00µs │          122.00µs │     1.00µs │       34.59 MiB/s │        100.25 MiB/s
 P10  │        766.00µs │          753.00µs │     2.00µs │     1016.50 MiB/s │       1064.00 MiB/s
 P50  │        826.00µs │          828.00µs │    49.00µs │     1211.00 MiB/s │       1208.00 MiB/s
 P90  │        984.00µs │          940.00µs │    86.00µs │     1306.00 MiB/s │       1329.00 MiB/s
 P100 │         28.90ms │            9.98ms │   427.00µs │     1462.00 MiB/s │       8200.00 MiB/s

aochagavia commented 1 year ago

How do you feel about the benchmark results? It'd be nice to be able to measure things in a more principled way, to ensure any performance difference (or lack thereof) is statistically significant. Maybe it'd make sense to simulate the networking part in memory (I have a semi-working local prototype), to make the output more predictable. I probably won't have the time to work on that, though...

aochagavia commented 1 year ago

Today I did another round of benchmarks using the changes from #1559, so I could simulate the link's delay (using bench instead of perf).

Command used: cargo run --release --bin bulk -- --no-protection --simulate-network --simulated-link-delay [delay] --download-size 200M --stats. Each run has its own value of delay, as you can see in the results below.

I ran the benchmarks on top of the main branch (equivalent to an ack-eliciting threshold of 0), and on top of this branch (using an ack-eliciting threshold of 1 and 10). In all cases, throughput remained similar, so I'm only reporting it a single time here, for each benchmarked link delay.

I assume throughput is being constrained by pacing. It would be interesting to tweak the config to start right away at a high transfer rate, because that would put real stress on Quinn and thereby reveal the true performance difference. I tried setting the initial congestion window to an absurdly high value, but somehow it didn't make any difference. Any clues on how to achieve higher throughput from the start? @Ralith, @djc

Another question: do these results make any sense at all? I am yet to develop a gut feeling for the relationship between delay and throughput.

Results

Delay = 5ms

      │  Throughput   │ Duration
──────┼───────────────┼──────────
 AVG  │   79.59 MiB/s │     2.51s
 P0   │   79.56 MiB/s │     2.51s
 P10  │   79.62 MiB/s │     2.51s
 P50  │   79.62 MiB/s │     2.51s
 P90  │   79.62 MiB/s │     2.51s
 P100 │   79.62 MiB/s │     2.51s

Delay = 10ms

      │  Throughput   │ Duration
──────┼───────────────┼──────────
 AVG  │   46.83 MiB/s │     4.27s
 P0   │   46.81 MiB/s │     4.27s
 P10  │   46.84 MiB/s │     4.27s
 P50  │   46.84 MiB/s │     4.27s
 P90  │   46.84 MiB/s │     4.27s
 P100 │   46.84 MiB/s │     4.27s

Delay = 50ms

      │  Throughput   │ Duration
──────┼───────────────┼──────────
 AVG  │   10.77 MiB/s │    18.57s
 P0   │   10.77 MiB/s │    18.56s
 P10  │   10.77 MiB/s │    18.58s
 P50  │   10.77 MiB/s │    18.58s
 P90  │   10.77 MiB/s │    18.58s
 P100 │   10.77 MiB/s │    18.58s

Delay = 200ms

      │  Throughput   │ Duration
──────┼───────────────┼──────────
 AVG  │    2.78 MiB/s │    72.03s
 P0   │    2.78 MiB/s │    72.00s
 P10  │    2.78 MiB/s │    72.06s
 P50  │    2.78 MiB/s │    72.06s
 P90  │    2.78 MiB/s │    72.06s
 P100 │    2.78 MiB/s │    72.06s

Ralith commented 1 year ago

I concur that testing on a clean loopback is probably going to make it difficult to judge the impact of this change.

I assume throughput is being constrained by pacing

This should be easy to test by hacking Pacer::delay to always return None.

do these results make any sense at all?

It's been a while since I've done manual WAN testing, but that seems much too slow.

I think we should expect throughput equal to one congestion window per RTT, and the congestion window should grow reasonably quickly unless there's packet loss, which should be impossible in the simulated context. The bandwidth you report seems almost exactly inversely proportional to round trip time, which suggests the congestion window is not growing. I wonder if Connection::app_limited is getting set spuriously? This is intended to prevent the congestion window from growing endlessly when sending less than the link's capacity. The logic is a bit fiddly (see e.g. https://github.com/quinn-rs/quinn/issues/1126).

It should be easy to verify whether the congestion window is growing by reporting PathStats::cwnd at intervals (or even just at start/end). A large initial congestion window really should have an obvious impact, though; even pacing is cwnd-based. When the initial window is set very high, and data is nonetheless not being sent, where does the Connection::poll_transmit call bail out?

It might be interesting to contrast the simulated network behavior in bench with the behavior of perf using netem or similar to artificially introduce delay/jitter. This should tell us whether there's something specifically weird about the simulated network logic.

aochagavia commented 1 year ago

As it turns out, throughput was limited by the default send and receive windows. We were increasing the RTT, yet keeping the window sizes constant, so the same amount of traffic had to be spread over a longer period, leading to low throughput. The solution:

config.stream_receive_window(VarInt::MAX);
config.send_window(u64::MAX);

I re-benchmarked this with the config above, and now we are seeing proper performance. For instance, with delay = 5 and ack-eliciting threshold = 10:

      │  Throughput   │ Duration 
──────┼───────────────┼──────────
 AVG  │ 1160.82 MiB/s │  885.00ms
 P0   │  980.50 MiB/s │  808.00ms
 P10  │  981.00 MiB/s │  808.00ms
 P50  │ 1161.00 MiB/s │  876.00ms
 P90  │ 1250.00 MiB/s │  904.00ms
 P100 │ 1268.00 MiB/s │     1.04s

Unfortunately, testing with different ack frequencies and delays seemed to make no difference as regards to performance (on my system).

@Ralith @djc all this time I've been working under the assumption that this PR only makes sense if performance improves (otherwise it introduces complexity to the codebase for no benefit). Or do you think it is worthwhile to have this anyway, as long as we can show that performance doesn't worsen?

By the way, if you are interested I can prepare a branch you can use to easily benchmark locally, using the loopback interface and introducing delays with tc to ensure nothing fishy is going on with the simulated network code (unfortunately tc doesn't work on WSL, and trying it out in a cloud VM yielded weird results).

djc commented 1 year ago

IIRC this draft was specifically introduced to improve performance, so if we can't reproduce that result there's likely some more digging to do. As such, I think as long as this doesn't introduce a (non-trivial) performance regression, I think it would still be okay to merge?

aochagavia commented 1 year ago

Just rebased on top of latest main, to resolve a few conflicts. Any chance this PR can be reviewed soon? I don't think there is anything left to do on my side (please let me know otherwise).

Ralith commented 1 year ago

Will get this in the next day or two; thanks for your patience!

aochagavia commented 1 year ago

Thanks for the thorough review! Just rebased on top of latest main and pushed a new version that incorporates all suggestions (unless anything slipped through the cracks).

aochagavia commented 1 year ago

Just rebased and pushed changes to address some of your suggestions. For those that need more clarification I left some comments and questions

aochagavia commented 1 year ago

Thanks for the review! Just rebased an pushed fixes for all points we discussed 😉

Almost there!

aochagavia commented 1 year ago

Is there anything else I can do to help land this? Just FYI, after July 7th it will be pretty difficult for me to find time to work on this

aochagavia commented 1 year ago

Thanks again for the review! Just rebased and pushed a new version with all your comments addressed.

djc commented 1 year ago

Sorry for taking so long to review this, thanks for all the work!

aochagavia commented 1 year ago

No worries, thank you and @Ralith for creating and maintaining Quinn!

quinn-rs / quinn

Implement acknowledgement frequency #1553

Practical details

Conclusion

Baseline (before this PR)

ACK frequency disabled (threshold = 1)

ACK frequency enabled on the client (threshold = 10)

ACK frequency enabled on the server (threshold = 10)

Results