Closed aochagavia closed 1 year ago
And... here are some benchmarks!
I'm running the server using cargo run --release --bin perf_server -- --no-protection
and the client using cargo run --release --bin perf_client -- --no-protection --duration 15
(note that this makes the benchmarks misleading for real-world use cases, because encryption is disabled). As you can deduce from the cargo commands used, I'm running both benchmarks on the same machine.
For the scenarios with ACK frequency enabled I'm doing so through the following code:
let mut ack_freq = AckFrequencyConfig::default();
ack_freq.ack_eliciting_threshold(10);
transport.ack_frequency_config(Some(ack_freq));
There seems to be no significant performance difference in the benchmarks, though we probably need a more statistically rigorous approach to benchmarking if we want to be sure. @stormshield-damiend suggested we might see clearer differences in performance if there is higher latency (instead of it being in the order of microseconds), so I'll benchmark again next week with a more realistic latency.
│ Upload Duration │ Download Duration | FBL | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
AVG │ 862.00µs │ 811.00µs │ 50.00µs │ 1199.97 MiB/s │ 1293.80 MiB/s
P0 │ 661.00µs │ 534.00µs │ 1.00µs │ 32.34 MiB/s │ 34.34 MiB/s
P10 │ 741.00µs │ 678.00µs │ 2.00µs │ 986.50 MiB/s │ 1089.00 MiB/s
P50 │ 810.00µs │ 762.00µs │ 47.00µs │ 1235.00 MiB/s │ 1313.00 MiB/s
P90 │ 1.01ms │ 919.00µs │ 79.00µs │ 1350.00 MiB/s │ 1475.00 MiB/s
P100 │ 30.91ms │ 29.12ms │ 470.00µs │ 1513.00 MiB/s │ 1873.00 MiB/s
│ Upload Duration │ Download Duration | FBL | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
AVG │ 920.00µs │ 805.00µs │ 53.00µs │ 1190.95 MiB/s │ 1280.93 MiB/s
P0 │ 670.00µs │ 134.00µs │ 1.00µs │ 33.44 MiB/s │ 34.12 MiB/s
P10 │ 746.00µs │ 699.00µs │ 4.00µs │ 996.50 MiB/s │ 1113.00 MiB/s
P50 │ 819.00µs │ 774.00µs │ 49.00µs │ 1222.00 MiB/s │ 1292.00 MiB/s
P90 │ 1.00ms │ 899.00µs │ 82.00µs │ 1341.00 MiB/s │ 1431.00 MiB/s
P100 │ 29.92ms │ 29.30ms │ 847.00µs │ 1493.00 MiB/s │ 7464.00 MiB/s
│ Upload Duration │ Download Duration | FBL | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
AVG │ 866.00µs │ 767.00µs │ 48.00µs │ 1187.26 MiB/s │ 1479.18 MiB/s
P0 │ 686.00µs │ 70.00µs │ 1.00µs │ 33.25 MiB/s │ 32.25 MiB/s
P10 │ 753.00µs │ 680.00µs │ 3.00µs │ 1020.50 MiB/s │ 1198.00 MiB/s
P50 │ 821.00µs │ 749.00µs │ 46.00µs │ 1219.00 MiB/s │ 1336.00 MiB/s
P90 │ 980.00µs │ 835.00µs │ 77.00µs │ 1329.00 MiB/s │ 1471.00 MiB/s
P100 │ 30.08ms │ 30.99ms │ 421.00µs │ 1458.00 MiB/s │ 14288.00 MiB/s
│ Upload Duration │ Download Duration | FBL | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
AVG │ 890.00µs │ 844.00µs │ 51.00µs │ 1182.28 MiB/s │ 1202.34 MiB/s
P0 │ 684.00µs │ 122.00µs │ 1.00µs │ 34.59 MiB/s │ 100.25 MiB/s
P10 │ 766.00µs │ 753.00µs │ 2.00µs │ 1016.50 MiB/s │ 1064.00 MiB/s
P50 │ 826.00µs │ 828.00µs │ 49.00µs │ 1211.00 MiB/s │ 1208.00 MiB/s
P90 │ 984.00µs │ 940.00µs │ 86.00µs │ 1306.00 MiB/s │ 1329.00 MiB/s
P100 │ 28.90ms │ 9.98ms │ 427.00µs │ 1462.00 MiB/s │ 8200.00 MiB/s
How do you feel about the benchmark results? It'd be nice to be able to measure things in a more principled way, to ensure any performance difference (or lack thereof) is statistically significant. Maybe it'd make sense to simulate the networking part in memory (I have a semi-working local prototype), to make the output more predictable. I probably won't have the time to work on that, though...
Today I did another round of benchmarks using the changes from #1559, so I could simulate the link's delay (using bench
instead of perf
).
Command used: cargo run --release --bin bulk -- --no-protection --simulate-network --simulated-link-delay [delay] --download-size 200M --stats
. Each run has its own value of delay
, as you can see in the results below.
I ran the benchmarks on top of the main branch (equivalent to an ack-eliciting threshold of 0), and on top of this branch (using an ack-eliciting threshold of 1 and 10). In all cases, throughput remained similar, so I'm only reporting it a single time here, for each benchmarked link delay.
I assume throughput is being constrained by pacing. It would be interesting to tweak the config to start right away at a high transfer rate, because that would put real stress on Quinn and thereby reveal the true performance difference. I tried setting the initial congestion window to an absurdly high value, but somehow it didn't make any difference. Any clues on how to achieve higher throughput from the start? @Ralith, @djc
Another question: do these results make any sense at all? I am yet to develop a gut feeling for the relationship between delay and throughput.
Delay = 5ms
│ Throughput │ Duration
──────┼───────────────┼──────────
AVG │ 79.59 MiB/s │ 2.51s
P0 │ 79.56 MiB/s │ 2.51s
P10 │ 79.62 MiB/s │ 2.51s
P50 │ 79.62 MiB/s │ 2.51s
P90 │ 79.62 MiB/s │ 2.51s
P100 │ 79.62 MiB/s │ 2.51s
Delay = 10ms
│ Throughput │ Duration
──────┼───────────────┼──────────
AVG │ 46.83 MiB/s │ 4.27s
P0 │ 46.81 MiB/s │ 4.27s
P10 │ 46.84 MiB/s │ 4.27s
P50 │ 46.84 MiB/s │ 4.27s
P90 │ 46.84 MiB/s │ 4.27s
P100 │ 46.84 MiB/s │ 4.27s
Delay = 50ms
│ Throughput │ Duration
──────┼───────────────┼──────────
AVG │ 10.77 MiB/s │ 18.57s
P0 │ 10.77 MiB/s │ 18.56s
P10 │ 10.77 MiB/s │ 18.58s
P50 │ 10.77 MiB/s │ 18.58s
P90 │ 10.77 MiB/s │ 18.58s
P100 │ 10.77 MiB/s │ 18.58s
Delay = 200ms
│ Throughput │ Duration
──────┼───────────────┼──────────
AVG │ 2.78 MiB/s │ 72.03s
P0 │ 2.78 MiB/s │ 72.00s
P10 │ 2.78 MiB/s │ 72.06s
P50 │ 2.78 MiB/s │ 72.06s
P90 │ 2.78 MiB/s │ 72.06s
P100 │ 2.78 MiB/s │ 72.06s
I concur that testing on a clean loopback is probably going to make it difficult to judge the impact of this change.
I assume throughput is being constrained by pacing
This should be easy to test by hacking Pacer::delay
to always return None
.
do these results make any sense at all?
It's been a while since I've done manual WAN testing, but that seems much too slow.
I think we should expect throughput equal to one congestion window per RTT, and the congestion window should grow reasonably quickly unless there's packet loss, which should be impossible in the simulated context. The bandwidth you report seems almost exactly inversely proportional to round trip time, which suggests the congestion window is not growing. I wonder if Connection::app_limited
is getting set spuriously? This is intended to prevent the congestion window from growing endlessly when sending less than the link's capacity. The logic is a bit fiddly (see e.g. https://github.com/quinn-rs/quinn/issues/1126).
It should be easy to verify whether the congestion window is growing by reporting PathStats::cwnd
at intervals (or even just at start/end). A large initial congestion window really should have an obvious impact, though; even pacing is cwnd-based. When the initial window is set very high, and data is nonetheless not being sent, where does the Connection::poll_transmit
call bail out?
It might be interesting to contrast the simulated network behavior in bench
with the behavior of perf
using netem
or similar to artificially introduce delay/jitter. This should tell us whether there's something specifically weird about the simulated network logic.
As it turns out, throughput was limited by the default send and receive windows. We were increasing the RTT, yet keeping the window sizes constant, so the same amount of traffic had to be spread over a longer period, leading to low throughput. The solution:
config.stream_receive_window(VarInt::MAX);
config.send_window(u64::MAX);
I re-benchmarked this with the config above, and now we are seeing proper performance. For instance, with delay = 5 and ack-eliciting threshold = 10:
│ Throughput │ Duration
──────┼───────────────┼──────────
AVG │ 1160.82 MiB/s │ 885.00ms
P0 │ 980.50 MiB/s │ 808.00ms
P10 │ 981.00 MiB/s │ 808.00ms
P50 │ 1161.00 MiB/s │ 876.00ms
P90 │ 1250.00 MiB/s │ 904.00ms
P100 │ 1268.00 MiB/s │ 1.04s
Unfortunately, testing with different ack frequencies and delays seemed to make no difference as regards to performance (on my system).
@Ralith @djc all this time I've been working under the assumption that this PR only makes sense if performance improves (otherwise it introduces complexity to the codebase for no benefit). Or do you think it is worthwhile to have this anyway, as long as we can show that performance doesn't worsen?
By the way, if you are interested I can prepare a branch you can use to easily benchmark locally, using the loopback interface and introducing delays with tc
to ensure nothing fishy is going on with the simulated network code (unfortunately tc
doesn't work on WSL, and trying it out in a cloud VM yielded weird results).
IIRC this draft was specifically introduced to improve performance, so if we can't reproduce that result there's likely some more digging to do. As such, I think as long as this doesn't introduce a (non-trivial) performance regression, I think it would still be okay to merge?
Just rebased on top of latest main, to resolve a few conflicts. Any chance this PR can be reviewed soon? I don't think there is anything left to do on my side (please let me know otherwise).
Will get this in the next day or two; thanks for your patience!
Thanks for the thorough review! Just rebased on top of latest main and pushed a new version that incorporates all suggestions (unless anything slipped through the cracks).
Just rebased and pushed changes to address some of your suggestions. For those that need more clarification I left some comments and questions
Thanks for the review! Just rebased an pushed fixes for all points we discussed 😉
Almost there!
Is there anything else I can do to help land this? Just FYI, after July 7th it will be pretty difficult for me to find time to work on this
Thanks again for the review! Just rebased and pushed a new version with all your comments addressed.
Sorry for taking so long to review this, thanks for all the work!
No worries, thank you and @Ralith for creating and maintaining Quinn!
Overview of the feature:
AckFrequencyConfig
to control the parameters, which are sent to the peer in anACK_FREQUENCY
frame at the beginning of the connection (technically, the extension allows modifying the ack frequency parameters multiple times during a connection, but then you would need to come up with an algorithm to automatically tune the parameters, which IMO is out of scope for this first implementation).ACK_FREQUENCY
frame, updating the local ack frequency parameters based on the new parameters sent by the peer.IMMEDIATE_ACK
frame every RTT if there are any unacked ack-eliciting packets (this is a new kind of frame to immediately elicit an ACK). This is recommended by the draft as a way to ensure congestion control and loss detection work properly.I have a bunch of questions / comments:
max_ack_delay
is elapsed, and one to be notified when an RTT is elapsed. I tried to come up with an approach that wouldn't require additional timers, but I'm not sure it is possible. It would be nice if you can double-check that, given your knowledge of the codebase.reset_rtt_timer
function that arms theTimer::Rtt
for the first time. I wanted to arm it right after we have reached the Data space and have an RTT estimate. I ended up calling it insideprocess_payload
. Please let me know if there is a more suitable place.PendingAcks::acks_sent
, which I found difficult to understand. Hopefully the new version says the same in clearer terms (otherwise let me know and I'll update it)