Quinn + IoUring benches and question

serzhiio commented 1 year ago

I've just implemented QuinnProto on top of IoUring, made several benchmarks and made a comparison with IoUrings+TCP+TLS(Rustls)+WS. The benchmark is a simple Ping/Pong (every 10ms) for WS and two Uni channels for Quic with ping/pong like data.

In general, Client after connection send ping with Timestamp, server responds with same timestamp to client and Client compare received timestamp with current time.

Poll:
    b:5000 ZeroCopy Send
Count QUIC latency (batch=5000) Mean: 76us ±31.328(std) | Skew: 2.063us | Kurt: 5.879us | MinMax: 33.000/345.000
Count WS+TLS latency (batch=5000) Mean: 72us ±30.648(std) | Skew: 3.084us | Kurt: 40.572us | MinMax: 25.000/698.000
    b:5000 ZeroCopy Send
Count QUIC latency (batch=5000) Mean: 79us ±34.908(std) | Skew: 2.570us | Kurt: 13.726us | MinMax: 33.000/503.000
Count WS+TLS latency (batch=5000) Mean: 77us ±35.788(std) | Skew: 1.961us | Kurt: 7.384us | MinMax: 26.000/437.000
    b:10000 ZeroCopy Send
Count QUIC latency (batch=10000) Mean: 112us ±52.715(std) | Skew: 2.103us | Kurt: 13.346us | MinMax: 33.000/847.000
Count QUIC latency (batch=10000) Mean: 76us ±31.389(std) | Skew: 2.102us | Kurt: 8.207us | MinMax: 30.000/406.000
Count QUIC latency (batch=10000) Mean: 74us ±31.293(std) | Skew: 2.547us | Kurt: 20.426us | MinMax: 29.000/672.000
Count WS+TLS latency (batch=10000) Mean: 95us ±43.731(std) | Skew: 1.751us | Kurt: 7.853us | MinMax: 26.000/632.000
Count WS+TLS latency (batch=10000) Mean: 59us ±24.811(std) | Skew: 2.677us | Kurt: 18.782us | MinMax: 24.000/523.000
Count WS+TLS latency (batch=10000) Mean: 59us ±25.257(std) | Skew: 4.522us | Kurt: 66.180us | MinMax: 25.000/728.000

    b:10000 nonZeroCopy Send
Count WS+TLS latency (batch=10000) Mean: 57us ±26.273(std) | Skew: 3.249us | Kurt: 22.248us | MinMax: 23.000/490.000
Count WS+TLS latency (batch=10000) Mean: 62us ±26.747(std) | Skew: 2.022us | Kurt: 9.757us | MinMax: 23.000/384.000
Count WS+TLS latency (batch=10000) Mean: 58us ±25.249(std) | Skew: 4.828us | Kurt: 86.664us | MinMax: 23.000/786.000
Count QUIC latency (batch=10000) Mean: 81us ±35.101(std) | Skew: 1.478us | Kurt: 4.784us | MinMax: 29.000/485.000
Count QUIC latency (batch=10000) Mean: 73us ±32.120(std) | Skew: 2.259us | Kurt: 11.126us | MinMax: 28.000/438.000
Count QUIC latency (batch=10000) Mean: 69us ±30.011(std) | Skew: 2.928us | Kurt: 19.232us | MinMax: 29.000/576.000

SqPoll:
    b:10000 ZeroCopy Send
Count QUIC latency (batch=10000) Mean: 44us ±16.253(std) | Skew: 3.223us | Kurt: 15.009us | MinMax: 28.000/210.000
Count WS+TLS latency (batch=10000) Mean: 37us ±12.580(std) | Skew: 3.153us | Kurt: 14.359us | MinMax: 22.000/146.000
    b:10000 nonZeroCopy Send
Count QUIC latency (batch=10000) Mean: 35us ±8.258(std) | Skew: 3.600us | Kurt: 24.076us | MinMax: 25.000/152.000
Count WS+TLS latency (batch=10000) Mean: 28us ±6.172(std) | Skew: 4.105us | Kurt: 29.297us | MinMax: 20.000/117.000

WebSocket implemetation shows slightly better results, is it expected to be like this or no? P.S.: IoUring's IoPoll was not tested.

Ralith commented 1 year ago

Marginal differences in latency over loopback are not meaningful, so this data doesn't say much. What type of performance is your actual application concerned about?

Note io_uring is unlikely to automatically provide significant performance benefits, and might even reduce performance compared to standard quinn if it is not carefully structured, e.g. leveraging the offload mechanisms used by quinn-udp. io_uring is mostly interesting in that it enables new ways to manage scheduling and concurrency, which are complex to take advantage of.

serzhiio commented 1 year ago

Marginal differences in latency over loopback are not meaningful, so this data doesn't say much. What type of performance is your actual application concerned about?

Marginal? This latency diffs is not marginal for me and io_uring really gives some perfomance over poll (especially when using provided receive buffers shared with OS), compared to my previous Mio-based engine, both not Futures-based. Apllication is concerned about latency :)

Note io_uring is unlikely to automatically provide significant performance benefits, and might even reduce performance compared to standard quinn if it is not carefully structured, e.g. leveraging the offload mechanisms used by quinn-udp. io_uring is mostly interesting in that it enables new ways to manage scheduling and concurrency, which are complex to take advantage of.

The main idea was to test engine overhead latency and not the network's one. So i'm trying to understand is QUIC supposed to have better latencies than TCP+Tls?

Ralith commented 1 year ago

This latency diffs is not marginal for me

What is your application such that the difference between e.g. 76us ±31.328 and 72us ±30.648 matters? Note the variance is much larger than the difference between means.

is QUIC supposed to have better latencies than TCP+Tls?

Choice of transport protocol will not meaningfully affect the latency of information traversing the loopback interface.

serzhiio commented 1 year ago

Setting bigger batches gives more difference. Last two results is much more informative and representative, it is almost a no context switch ops with SqPoll and 20% latency diff. Engine is for HFT applications and algorithms.

Ralith commented 1 year ago

The last two results show a 7μs difference with about the same variance; those are still very close. Latencies at that scale are going to be very sensitive to the details of your code. Is your io_uring backend employing GSO and GRO? Have you done any profiling?

serzhiio commented 1 year ago

The last two results show a 7μs difference with about the same variance; those are still very close. Latencies at that scale are going to be very sensitive to the details of your code. Is your io_uring backend employing GSO and GRO? Have you done any profiling?

Not yet, just finished implementation. Profiling is the next step. GSO and GRO is implemented as in quinn_udp but not sure if it's working or not, especially understanding that io_uring does not support libc::SYS_sendmmsg. UDP, in general, is a new stuff for me, so i may be wrong somewhere.

Ralith commented 1 year ago

GSO and GRO is implemented as in quinn_udp

Good, that's probably the single most important factor for kernel-side performance of a QUIC stack.

not sure if it's working or not

Check that you're getting Transmit structures with segment_size set to Some from quinn-proto, and that you're getting multiple segments from the kernel in GRO when many packets are incoming.

especially understanding that io_uring does not support libc::SYS_sendmmsg

This should be fine; sendmmsg is just submitting multiple operations in one syscall, which io_uring already enables on its own.

Not yet, just finished implementation.

I'll be interested to hear what you find!

Ralith commented 7 months ago

Closing as this seems to be stale, but feel free to open a new issue if there's something further to discuss.

quinn-rs / quinn

Quinn + IoUring benches and question #1537