quinn-rs / quinn

Async-friendly QUIC implementation in Rust
Apache License 2.0
3.85k stars 394 forks source link

Preallocate output vector in poll_transmit(). #1438

Closed stormshield-damiend closed 1 year ago

stormshield-damiend commented 2 years ago

This is a partial fix for #729 doing the whole optimization as suggested in this issue will require more work as output vector is put within Transmit { } struct, using a &[u8] will require adding a bunch of lifetime here and there. I started working on this but it will take more time.

This simple optimization lead to a performance increase of around 3% with default crypto and around 15% without packet encryption and decryption (using custom patch from PR1436).

Matthias247 commented 2 years ago

I remember I’ve looked at this before and already converted it to a more gradual allocation approach which goes fast towards a single allocation but not immediately.

This change here is more aggressive, but it will also overallocate a lot for ack-only packets (and there’s lots of those) and even if there’s no data to send.

I probably would recommend not to spend much more time into this at the moment. Once #1219 lands the path for Quinn to own the complete IO eventloop is pretty clear. And then one could directly go towards completion based IO (#915) with preallocated buffers which avoids all the dynamic Transmit allocation stuff.

Matthias247 commented 2 years ago

This was the previous change: https://github.com/quinn-rs/quinn/pull/1108

Reading that again I think it already does the same that you try to achieve here: Doing a single allocation and never resize the Vec. I'm wondering why you observed any performance difference. The only way where #1108 wouldn't do the same thing is no packet is produced at all (the code between Vec::new() and the resize detects we can't write a packet), and in that case not allocating should be more efficient 🤔

stormshield-damiend commented 2 years ago

This was the previous change: #1108

Reading that again I think it already does the same that you try to achieve here: Doing a single allocation and never resize the Vec. I'm wondering why you observed any performance difference. The only way where #1108 wouldn't do the same thing is no packet is produced at all (the code between Vec::new() and the resize detects we can't write a packet), and in that case not allocating should be more efficient thinking

Hi, i did a performance test using perf_client and perf_server on two VMs between two interfaces connected using a dedicated virtual switch, upload only and did some flamegraph on client side, if you do that you will see a lot of Vec allocation when pushing/appending data in poll_transmit() and its child. If you disabled packet cipher you will see it more in the flamegraph.

Ralith commented 2 years ago

Thanks for working on this!

I second @Matthias247's confusion as to why this makes a difference. In the existing code, buffer allocation is performed here: https://github.com/quinn-rs/quinn/blob/b80baa008d2eb0a585472772be449432a3c48fd8/quinn-proto/src/connection/mod.rs#L622-L624

Note that when buf.capacity() is zero, as it is initially, this should be requesting exactly the same amount of space. I wonder if what we're really seeing here is a difference in heuristics between with_capacity and reserve? Maybe reserve_exact would close the gap. Another possibility is that we have a bug that's leading to more incremental allocation elsewhere.

I'd like to be sure we understand the underlying mechanism before merging a change like this, as purely empirical optimizations are nearly impossible to maintain in the face of future refactoring.

stormshield-damiend commented 2 years ago

Thanks for working on this!

I second @Matthias247's confusion as to why this makes a difference. In the existing code, buffer allocation is performed here:

https://github.com/quinn-rs/quinn/blob/b80baa008d2eb0a585472772be449432a3c48fd8/quinn-proto/src/connection/mod.rs#L622-L624

Note that when buf.capacity() is zero, as it is initially, this should be requesting exactly the same amount of space. I wonder if what we're really seeing here is a difference in heuristics between with_capacity and reserve? Maybe reserve_exact would close the gap. Another possibility is that we have a bug that's leading to more incremental allocation elsewhere.

I'd like to be sure we understand the underlying mechanism before merging a change like this, as purely empirical optimizations are nearly impossible to maintain in the face of future refactoring.

i will take a look later this week and see if i can find an explanation.

djc commented 1 year ago

@stormshield-damiend @aochavia do you still want to explore this?

stormshield-damiend commented 1 year ago

@stormshield-damiend @aochavia do you still want to explore this?

not for now, i am busy on other stuff. By the way i think it may be related to no-encryption patch as packet may be bigger than when encrypted, what do you think ?

You can close the PR for now, i can submit another one later on.