Closed mxinden closed 2 days ago
The following builds are available for testing. Crossed-out builds did not succeed.
QUIC Interop Runner, client vs. server
QUIC Interop Runner, client vs. server
QUIC Interop Runner, client vs. server
Attention: Patch coverage is 98.48837%
with 13 lines
in your changes missing coverage. Please review.
Project coverage is 95.37%. Comparing base (
eb92e43
) to head (f866a2a
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Performance differences relative to 55e3a9363c28632dfb29ce91c7712cab1f6a58da.
time: [98.877 ns 99.204 ns 99.534 ns] change: [-12.567% -11.944% -11.215%] (p = 0.00 < 0.05) Found 14 outliers among 100 measurements (14.00%) 10 (10.00%) high mild 4 (4.00%) high severe
time: [116.94 ns 117.71 ns 118.88 ns] change: [-33.464% -33.143% -32.768%] (p = 0.00 < 0.05) Found 17 outliers among 100 measurements (17.00%) 1 (1.00%) low mild 4 (4.00%) high mild 12 (12.00%) high severe
time: [116.55 ns 117.02 ns 117.57 ns] change: [-39.686% -35.379% -32.759%] (p = 0.00 < 0.05) Found 13 outliers among 100 measurements (13.00%) 1 (1.00%) low severe 1 (1.00%) low mild 2 (2.00%) high mild 9 (9.00%) high severe
time: [97.876 ns 98.011 ns 98.159 ns] change: [-31.803% -31.179% -30.566%] (p = 0.00 < 0.05) Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe
time: [111.73 ms 111.78 ms 111.83 ms] change: [+0.2898% +0.3546% +0.4219%] (p = 0.00 < 0.05) Found 11 outliers among 100 measurements (11.00%) 6 (6.00%) low mild 5 (5.00%) high mild
time: [26.340 ms 27.597 ms 28.883 ms] change: [-8.9450% -3.4441% +2.3427%] (p = 0.25 > 0.05) Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild
time: [34.417 ms 36.080 ms 37.798 ms] change: [-11.452% -5.8066% +0.6803%] (p = 0.07 > 0.05) Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild
time: [26.051 ms 26.928 ms 27.833 ms] change: [-5.5948% -1.5392% +3.1644%] (p = 0.50 > 0.05) Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild
time: [43.023 ms 45.160 ms 47.311 ms] change: [-3.2363% +3.2694% +10.031%] (p = 0.33 > 0.05) Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild
time: [104.12 ms 104.36 ms 104.60 ms] thrpt: [956.00 MiB/s 958.22 MiB/s 960.45 MiB/s] change: time: [-10.280% -9.9749% -9.6778%] (p = 0.00 < 0.05) thrpt: [+10.715% +11.080% +11.458%]
time: [326.60 ms 329.71 ms 332.82 ms] thrpt: [30.047 Kelem/s 30.330 Kelem/s 30.619 Kelem/s] change: time: [+1.3183% +2.9210% +4.4778%] (p = 0.00 < 0.05) thrpt: [-4.2859% -2.8381% -1.3012%] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild
time: [34.715 ms 34.913 ms 35.130 ms] thrpt: [28.465 elem/s 28.643 elem/s 28.806 elem/s] change: time: [+2.6418% +3.4761% +4.2420%] (p = 0.00 < 0.05) thrpt: [-4.0693% -3.3593% -2.5738%] Found 13 outliers among 100 measurements (13.00%) 6 (6.00%) low mild 7 (7.00%) high severe
Transfer of 33554432 bytes over loopback. | Client | Server | CC | Pacing | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|---|---|---|---|
msquic | msquic | 220.4 ± 139.4 | 101.6 | 645.6 | 1.00 | |||
neqo | msquic | reno | on | 279.0 ± 120.2 | 207.5 | 593.1 | 1.00 | |
neqo | msquic | reno | 281.0 ± 122.0 | 204.7 | 613.9 | 1.00 | ||
neqo | msquic | cubic | on | 260.9 ± 78.4 | 206.9 | 456.7 | 1.00 | |
neqo | msquic | cubic | 215.8 ± 17.0 | 193.7 | 244.4 | 1.00 | ||
msquic | neqo | reno | on | 120.7 ± 84.2 | 80.8 | 363.3 | 1.00 | |
msquic | neqo | reno | 90.4 ± 20.2 | 79.8 | 176.8 | 1.00 | ||
msquic | neqo | cubic | on | 96.3 ± 26.5 | 82.7 | 218.0 | 1.00 | |
msquic | neqo | cubic | 94.6 ± 21.2 | 82.3 | 190.5 | 1.00 | ||
neqo | neqo | reno | on | 157.4 ± 75.9 | 99.5 | 354.0 | 1.00 | |
neqo | neqo | reno | 120.6 ± 27.6 | 95.0 | 212.1 | 1.00 | ||
neqo | neqo | cubic | on | 172.4 ± 89.8 | 100.2 | 405.5 | 1.00 | |
neqo | neqo | cubic | 163.5 ± 88.5 | 98.4 | 365.1 | 1.00 |
1-conn/1-100mb-resp (aka. Download)/client: 💚 Performance has improved.
time: [98.680 ms 98.918 ms 99.154 ms] thrpt: [1008.5 MiB/s 1010.9 MiB/s 1013.4 MiB/s] change: time: [-12.936% -12.560% -12.190%] (p = 0.00 < 0.05) thrpt: [+13.882% +14.364% +14.858%]
This looks promising.
This pull request is ready for a first review.
Due to its substantial size, I recommend beginning with the (updated) description of the pull request.
Long-lived send and receive buffers come with a bit of complexity. Do you think that complexity is worth the benefit? In other words, are you in favor of this change in general?
I think it is. In order to increase performance, we need to eliminate memory allocations and copies.
If yes, given that this is a large pull request, do you want me to split this pull request up into smaller pull requests, or do you prefer one atomic pull request?
I don't mind the large PR. Is there a way so the tests can continue to use a fn name process
and not process_alloc
? Many lines in the PR are only due to that renaming.
(Nit: I also find write_buffer
lengthy. Maybe simply out
?)
Long-lived send and receive buffers come with a bit of complexity. Do you think that complexity is worth the benefit? In other words, are you in favor of this change in general?
I think it is. In order to increase performance, we need to eliminate memory allocations and copies.
If yes, given that this is a large pull request, do you want me to split this pull request up into smaller pull requests, or do you prefer one atomic pull request?
I don't mind the large PR.
Great. Then, unless someone else objects, this is ready for more detailed reviews.
If yes, given that this is a large pull request, do you want me to split this pull request up into smaller pull requests, or do you prefer one atomic pull request?
I don't mind the large PR. Is there a way so the tests can continue to use a fn name
process
and notprocess_alloc
? Many lines in the PR are only due to that renaming.
Unfortunately I don't see a way around having two functions (e.g. process
and process_alloc
).
Reasons:
process
needs a 4th argument, namely write_buffer
.process
returns a borrowed Output<&'a [u8]>
, process_alloc
returns an owned Output
.(Obviously we can name the two functions however we like.)
(Nit: I also find
write_buffer
lengthy. Maybe simplyout
?)
Fine by me.
Can we leave process
as is and name the new four-arg version something else? process_into_buffer
or something better?
@larseggert I did the renames suggested above.
There are a bunch of
TODO
s (beyond those I marked) that probably need addressing.
Sorry for the noise. Before solving some of the trickier TODOs, I wanted to make sure the direction taken here, requiring the TODOs in the first place, is the direction we want to take.
Best example is the false-positive double-borrow in https://github.com/mozilla/neqo/pull/2093#discussion_r1768003763.
I plan to solve all TODOs before merging.
We call the individual payloads "segments", which as a TCP person I find confusing.
While it can be easily confused with the TCP usage of the same term, I do believe "segments" is the right word to refer to the parts of a GSO/GRO UDP datagram. See for example UDP GSO and GRO Linux Kernel patches and the documentation of the Windows equivalent.
@larseggert are you aware of any other word used to refer to the parts of a GSO/GRO UDP datagram?
We also talk about "segment length" and "stride length".
"stride" is only used when interacting with the quinn-udp
API. quinn-udp
uses "stride" and "segment" interchangeably. I don't know the origin of using "stride". I am not aware of any other system using the term for this meaning. Happy to start a discussion upstream if you think it is helpful.
As said above, I believe "segment" is the conventional term.
I think we should come up with and then use consistent terminology for what we call these new datagrams that have segments in them. [...]
DatagramVec
?PacketVec
?
I prefer the current approach, namely to extend the notion of a Datagram
to contain one or more segments. Each segment has the same 4-tuple (i.e. (src_ip, src_port, dst_ip, dst_port)
), TOS, ... . Thus logic that only handles the metadata of a Datagram
doesn't care whether it handles the metadata of one Datagram
, or the metadata of a collection of Datagram
s each with the same metadata. (Only exception being accounting logic like neqo_transport::Stats
.)
I don't feel strongly about it. In case it doesn't add too much noise, I am fine introducing a new datastructure, representing a collection of UDP datagram segments. SegmentedDatagram
maybe?
recvmmsg
The matter gets a bit more complicated in the future. Currently we only pass a single iovec
to recvmmsg
, thus only leveraging GRO
, not actually recvmmsg
. Once we pass multiple iovec
s to recvmmsg
, the UDP datagrams across iovec
s might differ in their metadata (e.g. 4-tuple) (recvmmsg
style). The the UDP datagrams within an iovec
will have equal metadata (GRO
style).
With the current approach, each iovec
would result in a single Datagram
with one or more segments, and the whole recv
system call with multiple iovec
s would result in a collection of Datagram
s each with one or more segments. Thus the Socket::recv
signature would change as following:
pub fn recv<'a>(
&self,
local_address: &SocketAddr,
recv_buf: &'a mut Vec<u8>,
- ) -> Result<Option<Datagram<&'a [u8]>>, io::Error> {
+ ) -> Result<impl Iterator<Item = Datagram<&'a [u8]>>, io::Error> {
Thanks for the detailed explanation. I'm OK with this.
@mxinden I guess this needs more work before it can be merged? Should we make this a draft until then?
This pull request is ready for a full review. All TODOs are addressed.
Thank you for the reviews @larseggert and @martinthomson.
It seems to me like you could do this in pieces, starting with the encoder/decoder changes.
Sounds good. I will address the comments here and then break this large pull request into smaller ones
Back to draft for now.
Receive path optimization has been merged through https://github.com/mozilla/neqo/pull/2184.
While this pull request can serve as guidance for a future send path patch, I assume it is better to start from scratch.
Thus closing here.
Description
This change is best summarized by the
process
function signature.On
main
branch theprocess
function looks as such:Datagram
. ThatDatagram
owns an allocation of the UDP payload, i.e. aVec<u8>
. Thus for each incoming UDP datagram, its payload is allocated in a newVec
.Output
. Most relevantly theOutput
variantOutput::Datagram(Datagram)
contains aDatagram
that again owns an allocation of the UDP payload, i.e. aVec<u8>
. Thus for each outgoing UDP datagram too, its payload is allocated in a newVec
.This commit changes the
process
function to:Datagram<&[u8]>
. But contrary to before,Datagram<&[u8]>
does not own an allocation of the UDP payload, but represents a view into a long-lived receive buffer containing the UDP payload.Output<&'a [u8]>
where theOutput::Datagram(Datagram<&'a [u8]>)
variant does not own an allocation of the UDP payload, but here as well represents a view into a long-lived write buffer the payload is written into. That write buffer lives outside ofneqo_transport::Connection
and is provided toprocess_into_buffer
aswrite_buffer: &'a mut Vec<u8>
. Note that bothwrite_buffer
andOutput
use the lifetime'a
, i.e. the latter is a view into the former.This change to the
process
function enables the following:neqo_transport
(e.g.neqo_bin
) has the OS write incoming UDP datagrams into a long-lived receive buffer (via e.g.recvmmsg
).neqo_transport::Connection::process_into_buffer
along with a long-lived write buffer.process_into_buffer
reads the UDP datagram from the long-lived receive buffer through theDatagram<&[u8]>
view and writes outgoing datagrams into the provided long-livedwrite_buffer
, returning a view into said buffer via aDatagram<&'a [u8]>
.process_into_buffer
can then pass the write buffer to the OS (e.g. viasendmsg
).To summarize a user can receive and send UDP datagrams, without allocation in the UDP IO path.
As an aside, the above is compatible with GSO and GRO, where a send and receive buffer contains a consecutive number of UDP datagram segments.
Performance impact
Early benchmarks are promising, showing e.g. a 10% improvement in the Download benchmark, and up to 40% improvement in the neqo-neqo-reno-pacing benchmark.
This pull request
Current
main
for comparisonhttps://github.com/mozilla/neqo/actions/runs/10850817785
Pull request status
This pull request is ready for review.
Replaces https://github.com/mozilla/neqo/pull/2076. Part of https://github.com/mozilla/neqo/issues/1693. Closes https://github.com/mozilla/neqo/issues/1922.