Closed lmpn closed 11 months ago
I've not forgotten about the perf. I will share the comparison later 👍
I'd be surprised if there was a significant performance improvement from this alone, but it brings us much closer to being able to reuse transmit buffers (perhaps by doing UDP transmits directly from connection tasks), which is more likely to have a direct impact.
Still, always good to check for the impact of changes on the I/O path.
I've ran cargo bench
for main and this branch. Results below:
poll_transmit_pre_alloc test large_data_10_streams ... bench: 30,968,912 ns/iter (+/- 11,592,202) = 338 MB/s test large_data_1_stream ... bench: 26,580,878 ns/iter (+/- 20,358,295) = 39 MB/s test small_data_100_streams ... bench: 14,363,537 ns/iter (+/- 17,569,828) test small_data_1_stream ... bench: 19,154,729 ns/iter (+/- 5,427,772)
latest main test large_data_10_streams ... bench: 30,603,466 ns/iter (+/- 11,396,640) = 342 MB/s test large_data_1_stream ... bench: 21,525,054 ns/iter (+/- 19,954,110) = 48 MB/s test small_data_100_streams ... bench: 17,087,725 ns/iter (+/- 22,559,371) test small_data_1_stream ... bench: 19,227,774 ns/iter (+/- 5,343,771)
If this is not enough please give me some pointers and I will try to do an analysis that fits the project.
@djc I've decided to expose the mtu from the connection. If there is already an approach to retrieve or calculate such value tell me 🙏
So this shows as a ~18% regression for small_data_100_streams
? That seems bad -- and surprising...
So I've executed cargo bench
10 times for main and this branch and compared the times(image below)
There is indeed a regression (on average) for large_data_1_stream
and small_data_100_streams
, 3% and 6% respectively.
@lmpn I'm having some difficulty reading your chart screenshot. Are the 4 aggregated figures for each separate test, in order? So there's a 20% speedup in small_data_1_stream
? That's confusing since the original single run you posted indicated no change there, and a similar speedup in small_data_100_streams
. That original run also indicates a very high variance, so maybe this data is too noisy to be meaningful anyway. I'll try to get some of my own numbers to compare.
We should be able to get identical performance to main
by having the quinn
layer create a fresh BytesMut
for each poll_transmit
call, right?
Sorry for the bad image. Check the one below and tell me if it is clearer. The speedups are calculated as (1-main_time/poll_transmit_pre_alloc_time)*100
The single run showed 18% regression for small_data_100_streams
. I assumed this was because there is too much noise/bad run and because I'm running in my own computer, thus, I took 10 runs of each branch.
As you can see above I calculated the speedup for the:
Even with noise this values show:
small_data_1_stream
has improved in every scenariolarge_data_1_stream
/ small_data_100_streams
large_data_10_streams
the difference seems negligible.We should be able to get identical performance to
main
by having thequinn
layer create a freshBytesMut
for eachpoll_transmit
call, right?
Yes I would assume so.
Thanks for the detailed analysis! To summarize: apparent massive speedup for small_data_1_stream
, maybe a slight slowdown (or just noise) elsewhere. Perhaps reusing a single BytesMut
is letting a single allocation serve multiple small packets.
I think this nets out as a win as written, personally. I've experimented on a couple of my machines (both Windows and Linux) and wasn't able to show a significant difference either way.
@lmpn was this motivated by Hacktoberfest, and if so, is there a merge deadline to make this count for your stats?
@lmpn was this motivated by Hacktoberfest, and if so, is there a merge deadline to make this count for your stats?
No... I just have interest in networking and performance programming.
Above the results of the benchmark with the latest commit on this branch. If someone can double check on their side it would be great.
My environment still registers too much noise to get meaningful data, but it's exciting that it had such an impact in yours! Maybe the impact is larger on Windows, where GSO isn't doing as much heavy lifting.
I've used
BytesMut
for convenience but I can change to&mut [u8]
.