#729: proto: write outgoing packets to caller-supplied memory

lmpn commented 11 months ago

I've used BytesMut for convenience but I can change to &mut [u8].

lmpn commented 11 months ago

I've not forgotten about the perf. I will share the comparison later 👍

Ralith commented 11 months ago

I'd be surprised if there was a significant performance improvement from this alone, but it brings us much closer to being able to reuse transmit buffers (perhaps by doing UDP transmits directly from connection tasks), which is more likely to have a direct impact.

Still, always good to check for the impact of changes on the I/O path.

lmpn commented 11 months ago

I've ran cargo bench for main and this branch. Results below:

poll_transmit_pre_alloc test large_data_10_streams ... bench: 30,968,912 ns/iter (+/- 11,592,202) = 338 MB/s test large_data_1_stream ... bench: 26,580,878 ns/iter (+/- 20,358,295) = 39 MB/s test small_data_100_streams ... bench: 14,363,537 ns/iter (+/- 17,569,828) test small_data_1_stream ... bench: 19,154,729 ns/iter (+/- 5,427,772)

latest main test large_data_10_streams ... bench: 30,603,466 ns/iter (+/- 11,396,640) = 342 MB/s test large_data_1_stream ... bench: 21,525,054 ns/iter (+/- 19,954,110) = 48 MB/s test small_data_100_streams ... bench: 17,087,725 ns/iter (+/- 22,559,371) test small_data_1_stream ... bench: 19,227,774 ns/iter (+/- 5,343,771)

If this is not enough please give me some pointers and I will try to do an analysis that fits the project.

lmpn commented 11 months ago

@djc I've decided to expose the mtu from the connection. If there is already an approach to retrieve or calculate such value tell me 🙏

djc commented 11 months ago

So this shows as a ~18% regression for small_data_100_streams? That seems bad -- and surprising...

lmpn commented 11 months ago

So I've executed cargo bench 10 times for main and this branch and compared the times(image below)

There is indeed a regression (on average) for large_data_1_stream and small_data_100_streams, 3% and 6% respectively.

Ralith commented 11 months ago

@lmpn I'm having some difficulty reading your chart screenshot. Are the 4 aggregated figures for each separate test, in order? So there's a 20% speedup in small_data_1_stream? That's confusing since the original single run you posted indicated no change there, and a similar speedup in small_data_100_streams. That original run also indicates a very high variance, so maybe this data is too noisy to be meaningful anyway. I'll try to get some of my own numbers to compare.

We should be able to get identical performance to main by having the quinn layer create a fresh BytesMut for each poll_transmit call, right?

lmpn commented 11 months ago

Sorry for the bad image. Check the one below and tell me if it is clearer. The speedups are calculated as (1-main_time/poll_transmit_pre_alloc_time)*100

The single run showed 18% regression for small_data_100_streams. I assumed this was because there is too much noise/bad run and because I'm running in my own computer, thus, I took 10 runs of each branch.

As you can see above I calculated the speedup for the:

average of all times
max of all times
min of all times
top 3 (average of the 3 lowest times)
top 5 (average of the 5 lowest times)

Even with noise this values show:

small_data_1_stream has improved in every scenario
there is a regression for large_data_1_stream / small_data_100_streams
(I think) large_data_10_streams the difference seems negligible.

We should be able to get identical performance to main by having the quinn layer create a fresh BytesMut for each poll_transmit call, right?

Yes I would assume so.

Ralith commented 11 months ago

Thanks for the detailed analysis! To summarize: apparent massive speedup for small_data_1_stream, maybe a slight slowdown (or just noise) elsewhere. Perhaps reusing a single BytesMut is letting a single allocation serve multiple small packets.

I think this nets out as a win as written, personally. I've experimented on a couple of my machines (both Windows and Linux) and wasn't able to show a significant difference either way.

djc commented 11 months ago

@lmpn was this motivated by Hacktoberfest, and if so, is there a merge deadline to make this count for your stats?

lmpn commented 11 months ago

@lmpn was this motivated by Hacktoberfest, and if so, is there a merge deadline to make this count for your stats?

No... I just have interest in networking and performance programming.

lmpn commented 11 months ago

Above the results of the benchmark with the latest commit on this branch. If someone can double check on their side it would be great.

Ralith commented 11 months ago

My environment still registers too much noise to get meaningful data, but it's exciting that it had such an impact in yours! Maybe the impact is larger on Windows, where GSO isn't doing as much heavy lifting.

quinn-rs / quinn

#729: proto: write outgoing packets to caller-supplied memory #1697