Closed stormshield-damiend closed 1 year ago
I remember I’ve looked at this before and already converted it to a more gradual allocation approach which goes fast towards a single allocation but not immediately.
This change here is more aggressive, but it will also overallocate a lot for ack-only packets (and there’s lots of those) and even if there’s no data to send.
I probably would recommend not to spend much more time into this at the moment. Once #1219 lands the path for Quinn to own the complete IO eventloop is pretty clear. And then one could directly go towards completion based IO (#915) with preallocated buffers which avoids all the dynamic Transmit allocation stuff.
This was the previous change: https://github.com/quinn-rs/quinn/pull/1108
Reading that again I think it already does the same that you try to achieve here: Doing a single allocation and never resize the Vec
. I'm wondering why you observed any performance difference. The only way where #1108 wouldn't do the same thing is no packet is produced at all (the code between Vec::new()
and the resize detects we can't write a packet), and in that case not allocating should be more efficient 🤔
This was the previous change: #1108
Reading that again I think it already does the same that you try to achieve here: Doing a single allocation and never resize the
Vec
. I'm wondering why you observed any performance difference. The only way where #1108 wouldn't do the same thing is no packet is produced at all (the code betweenVec::new()
and the resize detects we can't write a packet), and in that case not allocating should be more efficient thinking
Hi, i did a performance test using perf_client and perf_server on two VMs between two interfaces connected using a dedicated virtual switch, upload only and did some flamegraph on client side, if you do that you will see a lot of Vec allocation when pushing/appending data in poll_transmit() and its child. If you disabled packet cipher you will see it more in the flamegraph.
Thanks for working on this!
I second @Matthias247's confusion as to why this makes a difference. In the existing code, buffer allocation is performed here: https://github.com/quinn-rs/quinn/blob/b80baa008d2eb0a585472772be449432a3c48fd8/quinn-proto/src/connection/mod.rs#L622-L624
Note that when buf.capacity()
is zero, as it is initially, this should be requesting exactly the same amount of space. I wonder if what we're really seeing here is a difference in heuristics between with_capacity
and reserve
? Maybe reserve_exact
would close the gap. Another possibility is that we have a bug that's leading to more incremental allocation elsewhere.
I'd like to be sure we understand the underlying mechanism before merging a change like this, as purely empirical optimizations are nearly impossible to maintain in the face of future refactoring.
Thanks for working on this!
I second @Matthias247's confusion as to why this makes a difference. In the existing code, buffer allocation is performed here:
Note that when
buf.capacity()
is zero, as it is initially, this should be requesting exactly the same amount of space. I wonder if what we're really seeing here is a difference in heuristics betweenwith_capacity
andreserve
? Maybereserve_exact
would close the gap. Another possibility is that we have a bug that's leading to more incremental allocation elsewhere.I'd like to be sure we understand the underlying mechanism before merging a change like this, as purely empirical optimizations are nearly impossible to maintain in the face of future refactoring.
i will take a look later this week and see if i can find an explanation.
@stormshield-damiend @aochavia do you still want to explore this?
@stormshield-damiend @aochavia do you still want to explore this?
not for now, i am busy on other stuff. By the way i think it may be related to no-encryption patch as packet may be bigger than when encrypted, what do you think ?
You can close the PR for now, i can submit another one later on.
This is a partial fix for #729 doing the whole optimization as suggested in this issue will require more work as output vector is put within Transmit { } struct, using a &[u8] will require adding a bunch of lifetime here and there. I started working on this but it will take more time.
This simple optimization lead to a performance increase of around 3% with default crypto and around 15% without packet encryption and decryption (using custom patch from PR1436).