perf: don't allocate in UDP send & recv path

mxinden commented 2 months ago

Description

This change is best summarized by the process function signature.

On main branch the process function looks as such:

pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output {

It takes as input an optional reference to a Datagram. That Datagram owns an allocation of the UDP payload, i.e. a Vec<u8>. Thus for each incoming UDP datagram, its payload is allocated in a new Vec.
It returns as output an owned Output. Most relevantly the Output variant Output::Datagram(Datagram) contains a Datagram that again owns an allocation of the UDP payload, i.e. a Vec<u8>. Thus for each outgoing UDP datagram too, its payload is allocated in a new Vec.

This commit changes the process function to:

pub fn process_into_buffer<'a>(
    &mut self,
    input: Option<Datagram<&[u8]>>,
    now: Instant,
    write_buffer: &'a mut Vec<u8>,
) -> Output<&'a [u8]> {

It takes as input an optional Datagram<&[u8]>. But contrary to before, Datagram<&[u8]> does not own an allocation of the UDP payload, but represents a view into a long-lived receive buffer containing the UDP payload.
It returns as output an Output<&'a [u8]> where the Output::Datagram(Datagram<&'a [u8]>) variant does not own an allocation of the UDP payload, but here as well represents a view into a long-lived write buffer the payload is written into. That write buffer lives outside of neqo_transport::Connection and is provided to process_into_buffer as write_buffer: &'a mut Vec<u8>. Note that both write_buffer and Output use the lifetime 'a, i.e. the latter is a view into the former.

This change to the process function enables the following:

A user of neqo_transport (e.g. neqo_bin) has the OS write incoming UDP datagrams into a long-lived receive buffer (via e.g. recvmmsg).
They pass that receive buffer to neqo_transport::Connection::process_into_buffer along with a long-lived write buffer.
process_into_buffer reads the UDP datagram from the long-lived receive buffer through the Datagram<&[u8]> view and writes outgoing datagrams into the provided long-lived write_buffer, returning a view into said buffer via a Datagram<&'a [u8]>.
The user, after having called process_into_buffer can then pass the write buffer to the OS (e.g. via sendmsg).

To summarize a user can receive and send UDP datagrams, without allocation in the UDP IO path.

As an aside, the above is compatible with GSO and GRO, where a send and receive buffer contains a consecutive number of UDP datagram segments.

Performance impact

Early benchmarks are promising, showing e.g. a 10% improvement in the Download benchmark, and up to 40% improvement in the neqo-neqo-reno-pacing benchmark.

This pull request

1-conn/1-100mb-resp (aka. Download)/client: :green_heart: Performance has improved.
       time:   [102.43 ms 102.62 ms 102.81 ms]
       thrpt:  [972.68 MiB/s 974.46 MiB/s 976.24 MiB/s]
change:
       time:   [-9.6439% -9.2860% -8.9313%] (p = 0.00 < 0.05)
       thrpt:  [+9.8072% +10.237% +10.673%]
Client/server transfer results

Transfer of 33554432 bytes over loopback. Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative

neqo neqo reno on 152.5 ± 87.8 95.6 365.6 1.00

neqo neqo reno 141.6 ± 67.7 94.9 326.1 1.00

neqo neqo cubic on 170.4 ± 121.9 94.6 622.5 1.00

neqo neqo cubic 131.4 ± 48.4 95.6 298.7 1.00

Transfer of 33554432 bytes over loopback.	Client	Server	CC	Pacing	Mean [ms]	Min [ms]	Max [ms]	Relative
neqo	neqo	reno	on	152.5 ± 87.8	95.6	365.6	1.00
neqo	neqo	reno		141.6 ± 67.7	94.9	326.1	1.00
neqo	neqo	cubic	on	170.4 ± 121.9	94.6	622.5	1.00
neqo	neqo	cubic		131.4 ± 48.4	95.6	298.7	1.00

Current `main` for comparison

1-conn/1-100mb-resp (aka. Download)/client: :green_heart: Performance has improved.
       time:   [112.72 ms 113.13 ms 113.51 ms]
       thrpt:  [880.95 MiB/s 883.97 MiB/s 887.14 MiB/s]
change:
       time:   [-2.1601% -1.6758% -1.1570%] (p = 0.00 < 0.05)
       thrpt:  [+1.1705% +1.7044% +2.2078%]

Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild
Client/server transfer results

Transfer of 33554432 bytes over loopback. Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative

neqo neqo reno on 260.8 ± 159.4 127.4 691.6 1.00

neqo neqo reno 221.2 ± 90.7 139.6 432.0 1.00

neqo neqo cubic on 214.5 ± 87.0 125.0 375.0 1.00

neqo neqo cubic 236.2 ± 118.1 136.8 540.0 1.00

Transfer of 33554432 bytes over loopback.	Client	Server	CC	Pacing	Mean [ms]	Min [ms]	Max [ms]	Relative
neqo	neqo	reno	on	260.8 ± 159.4	127.4	691.6	1.00
neqo	neqo	reno		221.2 ± 90.7	139.6	432.0	1.00
neqo	neqo	cubic	on	214.5 ± 87.0	125.0	375.0	1.00
neqo	neqo	cubic		236.2 ± 118.1	136.8	540.0	1.00

https://github.com/mozilla/neqo/actions/runs/10850817785

Pull request status

This pull request is ready for review.

Replaces https://github.com/mozilla/neqo/pull/2076. Part of https://github.com/mozilla/neqo/issues/1693. Closes https://github.com/mozilla/neqo/issues/1922.

github-actions[bot] commented 2 months ago

Firefox builds for this PR

The following builds are available for testing. Crossed-out builds did not succeed.

Linux: [Debug]() [Release]()
macOS: [Debug]() [Release]()
Windows: [Debug]() [Release]()

github-actions[bot] commented 2 months ago

Failed Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. lsquic: L1
neqo-latest vs. msquic: A
neqo-latest vs. mvfst: A L1 C1
neqo-latest vs. picoquic: L1
neqo-latest vs. quic-go: C1
neqo-latest vs. quinn: C1
neqo-latest vs. xquic: A
neqo-latest as server
lsquic vs. neqo-latest: L1
msquic vs. neqo-latest: U
mvfst vs. neqo-latest: Z A L1 C1
quinn vs. neqo-latest: V2
xquic vs. neqo-latest: M
All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
neqo-latest vs. go-x-net: H DC LR M B U A L2 C2 6
neqo-latest vs. haproxy: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
neqo-latest vs. kwik: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
neqo-latest vs. lsquic: H DC LR C20 M S R Z 3 B U E A L2 C1 C2 6 V2
neqo-latest vs. msquic: H DC LR C20 M S R Z B U L1 L2 C1 C2 6 V2
neqo-latest vs. mvfst: H DC LR M R Z 3 B U L2 C2 6
neqo-latest vs. neqo: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
neqo-latest vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
neqo-latest vs. nginx: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6
neqo-latest vs. ngtcp2: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
neqo-latest vs. picoquic: H DC LR C20 M S R Z 3 B U E A L2 C1 C2 6 V2
neqo-latest vs. quic-go: H DC LR C20 M S R Z 3 B U A L1 L2 C2 6
neqo-latest vs. quiche: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6
neqo-latest vs. quinn: H DC LR C20 M S R Z 3 B U E A L1 L2 C2 6
neqo-latest vs. s2n-quic: H DC LR C20 M S R 3 B U E A L1 L2 C1 C2 6
neqo-latest vs. xquic: H DC LR C20 M R Z 3 B U L1 L2 C1 C2 6
neqo-latest as server
aioquic vs. neqo-latest: H DC LR C20 M S R Z 3 B A L1 L2 C1 C2 6 V2
chrome vs. neqo-latest: 3
go-x-net vs. neqo-latest: H DC LR M B U A L2 C2 6
kwik vs. neqo-latest: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
lsquic vs. neqo-latest: H DC LR M S R 3 B E A L2 C1 C2 6 V2
msquic vs. neqo-latest: H DC LR C20 M S R Z B A L1 L2 C1 C2 6 V2
mvfst vs. neqo-latest: H DC LR M 3 B L2 C2 6
neqo vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
ngtcp2 vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
picoquic vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
quic-go vs. neqo-latest: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6
quiche vs. neqo-latest: H DC LR M S R Z 3 B A L1 L2 C1 C2 6
quinn vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6
s2n-quic vs. neqo-latest: H DC LR M S R 3 B E A L1 L2 C1 C2 6
xquic vs. neqo-latest: H DC LR C20 S R Z 3 B U A L1 L2 C1 C2 6
Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client
neqo-latest vs. aioquic: E
neqo-latest vs. go-x-net: C20 S R Z 3 E L1 C1 V2
neqo-latest vs. haproxy: E
neqo-latest vs. kwik: E
neqo-latest vs. msquic: 3 E
neqo-latest vs. mvfst: C20 S E V2
neqo-latest vs. nginx: E V2
neqo-latest vs. quic-go: E V2
neqo-latest vs. quiche: E V2
neqo-latest vs. quinn: V2
neqo-latest vs. s2n-quic: Z V2
neqo-latest vs. xquic: S E V2
neqo-latest as server
aioquic vs. neqo-latest: U E
chrome vs. neqo-latest: H DC LR C20 M S R Z B U E A L1 L2 C1 C2 6 V2
go-x-net vs. neqo-latest: C20 S R Z 3 E L1 C1 V2
kwik vs. neqo-latest: E
lsquic vs. neqo-latest: C20 Z U
msquic vs. neqo-latest: 3 E
mvfst vs. neqo-latest: C20 S R U E V2
quic-go vs. neqo-latest: E V2
quiche vs. neqo-latest: C20 U E V2
s2n-quic vs. neqo-latest: C20 Z U V2
xquic vs. neqo-latest: E V2

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 98.48837% with 13 lines in your changes missing coverage. Please review.

Project coverage is 95.37%. Comparing base (eb92e43) to head (f866a2a).

Files with missing lines	Patch %	Lines
neqo-transport/src/connection/mod.rs	96.59%	6 Missing :warning:
neqo-transport/src/server.rs	94.91%	3 Missing :warning:
neqo-http3/src/server.rs	93.10%	2 Missing :warning:
neqo-transport/src/packet/mod.rs	98.86%	1 Missing :warning:
neqo-udp/src/lib.rs	98.38%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #2093 +/- ## ========================================== + Coverage 95.35% 95.37% +0.01% ========================================== Files 112 112 Lines 36357 36715 +358 ========================================== + Hits 34669 35016 +347 - Misses 1688 1699 +11 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

github-actions[bot] commented 2 months ago

Benchmark results

Performance differences relative to 55e3a9363c28632dfb29ce91c7712cab1f6a58da.

coalesce_acked_from_zero 1+1 entries: :green_heart: Performance has improved.

       time:   [98.877 ns 99.204 ns 99.534 ns]
       change: [-12.567% -11.944% -11.215%] (p = 0.00 < 0.05)

Found 14 outliers among 100 measurements (14.00%)
  10 (10.00%) high mild
  4 (4.00%) high severe

coalesce_acked_from_zero 3+1 entries: :green_heart: Performance has improved.

       time:   [116.94 ns 117.71 ns 118.88 ns]
       change: [-33.464% -33.143% -32.768%] (p = 0.00 < 0.05)

Found 17 outliers among 100 measurements (17.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  12 (12.00%) high severe

coalesce_acked_from_zero 10+1 entries: :green_heart: Performance has improved.

       time:   [116.55 ns 117.02 ns 117.57 ns]
       change: [-39.686% -35.379% -32.759%] (p = 0.00 < 0.05)

Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  9 (9.00%) high severe

coalesce_acked_from_zero 1000+1 entries: :green_heart: Performance has improved.

       time:   [97.876 ns 98.011 ns 98.159 ns]
       change: [-31.803% -31.179% -30.566%] (p = 0.00 < 0.05)

Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

RxStreamOrderer::inbound_frame(): Change within noise threshold.

       time:   [111.73 ms 111.78 ms 111.83 ms]
       change: [+0.2898% +0.3546% +0.4219%] (p = 0.00 < 0.05)

Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) low mild
  5 (5.00%) high mild

transfer/pacing-false/varying-seeds: No change in performance detected.

       time:   [26.340 ms 27.597 ms 28.883 ms]
       change: [-8.9450% -3.4441% +2.3427%] (p = 0.25 > 0.05)

Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

transfer/pacing-true/varying-seeds: No change in performance detected.

       time:   [34.417 ms 36.080 ms 37.798 ms]
       change: [-11.452% -5.8066% +0.6803%] (p = 0.07 > 0.05)

Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

transfer/pacing-false/same-seed: No change in performance detected.

       time:   [26.051 ms 26.928 ms 27.833 ms]
       change: [-5.5948% -1.5392% +3.1644%] (p = 0.50 > 0.05)

Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

transfer/pacing-true/same-seed: No change in performance detected.

       time:   [43.023 ms 45.160 ms 47.311 ms]
       change: [-3.2363% +3.2694% +10.031%] (p = 0.33 > 0.05)

Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

1-conn/1-100mb-resp (aka. Download)/client: :green_heart: Performance has improved.

       time:   [104.12 ms 104.36 ms 104.60 ms]
       thrpt:  [956.00 MiB/s 958.22 MiB/s 960.45 MiB/s]
change:
       time:   [-10.280% -9.9749% -9.6778%] (p = 0.00 < 0.05)
       thrpt:  [+10.715% +11.080% +11.458%]

1-conn/10_000-parallel-1b-resp (aka. RPS)/client: :broken_heart: Performance has regressed.

       time:   [326.60 ms 329.71 ms 332.82 ms]
       thrpt:  [30.047 Kelem/s 30.330 Kelem/s 30.619 Kelem/s]
change:
       time:   [+1.3183% +2.9210% +4.4778%] (p = 0.00 < 0.05)
       thrpt:  [-4.2859% -2.8381% -1.3012%]

Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

1-conn/1-1b-resp (aka. HPS)/client: :broken_heart: Performance has regressed.

       time:   [34.715 ms 34.913 ms 35.130 ms]
       thrpt:  [28.465  elem/s 28.643  elem/s 28.806  elem/s]
change:
       time:   [+2.6418% +3.4761% +4.2420%] (p = 0.00 < 0.05)
       thrpt:  [-4.0693% -3.3593% -2.5738%]

Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) low mild
  7 (7.00%) high severe

Client/server transfer results

Transfer of 33554432 bytes over loopback.	Client	Server	CC	Pacing	Mean [ms]	Min [ms]	Max [ms]
msquic	msquic			220.4 ± 139.4	101.6	645.6	1.00
neqo	msquic	reno	on	279.0 ± 120.2	207.5	593.1	1.00
neqo	msquic	reno		281.0 ± 122.0	204.7	613.9	1.00
neqo	msquic	cubic	on	260.9 ± 78.4	206.9	456.7	1.00
neqo	msquic	cubic		215.8 ± 17.0	193.7	244.4	1.00
msquic	neqo	reno	on	120.7 ± 84.2	80.8	363.3	1.00
msquic	neqo	reno		90.4 ± 20.2	79.8	176.8	1.00
msquic	neqo	cubic	on	96.3 ± 26.5	82.7	218.0	1.00
msquic	neqo	cubic		94.6 ± 21.2	82.3	190.5	1.00
neqo	neqo	reno	on	157.4 ± 75.9	99.5	354.0	1.00
neqo	neqo	reno		120.6 ± 27.6	95.0	212.1	1.00
neqo	neqo	cubic	on	172.4 ± 89.8	100.2	405.5	1.00
neqo	neqo	cubic		163.5 ± 88.5	98.4	365.1	1.00

:arrow_down: Download logs

mxinden commented 2 months ago

1-conn/1-100mb-resp (aka. Download)/client: 💚 Performance has improved.

time: [98.680 ms 98.918 ms 99.154 ms] thrpt: [1008.5 MiB/s 1010.9 MiB/s 1013.4 MiB/s] change: time: [-12.936% -12.560% -12.190%] (p = 0.00 < 0.05) thrpt: [+13.882% +14.364% +14.858%]

This looks promising.

mxinden commented 2 months ago

This pull request is ready for a first review.

Due to its substantial size, I recommend beginning with the (updated) description of the pull request.

larseggert commented 2 months ago

Long-lived send and receive buffers come with a bit of complexity. Do you think that complexity is worth the benefit? In other words, are you in favor of this change in general?

I think it is. In order to increase performance, we need to eliminate memory allocations and copies.

If yes, given that this is a large pull request, do you want me to split this pull request up into smaller pull requests, or do you prefer one atomic pull request?

I don't mind the large PR. Is there a way so the tests can continue to use a fn name process and not process_alloc? Many lines in the PR are only due to that renaming.

(Nit: I also find write_buffer lengthy. Maybe simply out?)

mxinden commented 2 months ago

Long-lived send and receive buffers come with a bit of complexity. Do you think that complexity is worth the benefit? In other words, are you in favor of this change in general?

I think it is. In order to increase performance, we need to eliminate memory allocations and copies.

If yes, given that this is a large pull request, do you want me to split this pull request up into smaller pull requests, or do you prefer one atomic pull request?

I don't mind the large PR.

Great. Then, unless someone else objects, this is ready for more detailed reviews.

If yes, given that this is a large pull request, do you want me to split this pull request up into smaller pull requests, or do you prefer one atomic pull request?

I don't mind the large PR. Is there a way so the tests can continue to use a fn name process and not process_alloc? Many lines in the PR are only due to that renaming.

Unfortunately I don't see a way around having two functions (e.g. process and process_alloc).

Reasons:

process needs a 4th argument, namely write_buffer.
process returns a borrowed Output<&'a [u8]>, process_alloc returns an owned Output.

(Obviously we can name the two functions however we like.)

(Nit: I also find write_buffer lengthy. Maybe simply out?)

Fine by me.

larseggert commented 2 months ago

Can we leave process as is and name the new four-arg version something else? process_into_buffer or something better?

mxinden commented 2 months ago

@larseggert I did the renames suggested above.

mxinden commented 2 months ago

There are a bunch of TODOs (beyond those I marked) that probably need addressing.

Sorry for the noise. Before solving some of the trickier TODOs, I wanted to make sure the direction taken here, requiring the TODOs in the first place, is the direction we want to take.

Best example is the false-positive double-borrow in https://github.com/mozilla/neqo/pull/2093#discussion_r1768003763.

I plan to solve all TODOs before merging.

mxinden commented 1 month ago

Naming

We call the individual payloads "segments", which as a TCP person I find confusing.

While it can be easily confused with the TCP usage of the same term, I do believe "segments" is the right word to refer to the parts of a GSO/GRO UDP datagram. See for example UDP GSO and GRO Linux Kernel patches and the documentation of the Windows equivalent.

@larseggert are you aware of any other word used to refer to the parts of a GSO/GRO UDP datagram?

We also talk about "segment length" and "stride length".

"stride" is only used when interacting with the quinn-udp API. quinn-udp uses "stride" and "segment" interchangeably. I don't know the origin of using "stride". I am not aware of any other system using the term for this meaning. Happy to start a discussion upstream if you think it is helpful.

As said above, I believe "segment" is the conventional term.

Datastructure for a collection of UDP datagram segments

I think we should come up with and then use consistent terminology for what we call these new datagrams that have segments in them. [...] DatagramVec? PacketVec?

I prefer the current approach, namely to extend the notion of a Datagram to contain one or more segments. Each segment has the same 4-tuple (i.e. (src_ip, src_port, dst_ip, dst_port)), TOS, ... . Thus logic that only handles the metadata of a Datagram doesn't care whether it handles the metadata of one Datagram, or the metadata of a collection of Datagrams each with the same metadata. (Only exception being accounting logic like neqo_transport::Stats.)

I don't feel strongly about it. In case it doesn't add too much noise, I am fine introducing a new datastructure, representing a collection of UDP datagram segments. SegmentedDatagram maybe?

Future with `recvmmsg`

The matter gets a bit more complicated in the future. Currently we only pass a single iovec to recvmmsg, thus only leveraging GRO, not actually recvmmsg. Once we pass multiple iovecs to recvmmsg, the UDP datagrams across iovecs might differ in their metadata (e.g. 4-tuple) (recvmmsg style). The the UDP datagrams within an iovec will have equal metadata (GRO style).

With the current approach, each iovec would result in a single Datagram with one or more segments, and the whole recv system call with multiple iovecs would result in a collection of Datagrams each with one or more segments. Thus the Socket::recv signature would change as following:

     pub fn recv<'a>(
         &self,
         local_address: &SocketAddr,
         recv_buf: &'a mut Vec<u8>,
-    ) -> Result<Option<Datagram<&'a [u8]>>, io::Error> {
+    ) -> Result<impl Iterator<Item = Datagram<&'a [u8]>>, io::Error> {

larseggert commented 1 month ago

Thanks for the detailed explanation. I'm OK with this.

larseggert commented 1 month ago

@mxinden I guess this needs more work before it can be merged? Should we make this a draft until then?

mxinden commented 1 month ago

This pull request is ready for a full review. All TODOs are addressed.

mxinden commented 1 month ago

Thank you for the reviews @larseggert and @martinthomson.

It seems to me like you could do this in pieces, starting with the encoder/decoder changes.

Sounds good. I will address the comments here and then break this large pull request into smaller ones

Back to draft for now.

mxinden commented 2 days ago

Receive path optimization has been merged through https://github.com/mozilla/neqo/pull/2184.

While this pull request can serve as guidance for a future send path patch, I assume it is better to start from scratch.

Thus closing here.

mozilla / neqo