uNetworking / uWebSockets

Simple, secure & standards compliant web server for the most demanding of applications
Apache License 2.0
17.06k stars 1.75k forks source link

io_uring new uSockets backend #1603

Open uNetworkingAB opened 1 year ago

uNetworkingAB commented 1 year ago

As of Linux 5.19 I now see slight outperformance with io_uring for an actual networked ping/ping benchmark involving client and server TCP processes. It's less than 10% on my computers, often less than 5%, so we do not see any gigantic gains since we come from a highly optimized epoll usage. But there are gains, esp. for mitigated systems.

Long term plan is to add a new backend in uSockets. It's very, very low priority but I'm interested long term

uNetworkingAB commented 1 year ago

Scratch that - as of Linux 6.0 there are significant gains with io_uring. I can see 18% gains in testing if using the 6.0 features, and they make a lot of sense now even for low memory cases and many restrictions have been removed.

This is going to be a new backend in uSockets, probably the default one as soon as possible. I like it and it will be nice to have a path that is entirely optimized solely for Linux.

uNetworkingAB commented 1 year ago

Scratch that, the gains I'm seeing is more like 21%

uNetworkingAB commented 1 year ago

Linux 6.0+ is crazy fast. I have a build of uWS that does HTTP messaging with URL router over io_uring, faster than raw TCP messaging over epoll.

This perf. is going to compound for pub/sub use cases and there's no doubt that io_uring will be default in v21. This is a game changer.

My shitty 10 year old semi-budget PC can do 406k messages per second, on 1 CPU core. This was 325k with epoll. And it does 336k URL routed HTTP messages per second. That's slap in the face kind of performance.

dalisoft commented 1 year ago

@uNetworkingAB Is there will be release of uWebSockets.js as alpha for this implementation?

uNetworkingAB commented 1 year ago

I was still experimenting with a POC 5 days ago, and got this working in uSockets for the first time yesterday. So we are many months away from even beginning to think about what to do about Node.js integration.

But there has always been a difference between what is used in uWebSockets.js and what is used in uWebSockets - one use libuv and the other used raw epoll. So it will most likely just change to being libuv for one and io_uring for the other.

dalisoft commented 1 year ago

So this implementation only for C++?

angelsanzn commented 1 year ago

For the sake of completeness: apparently libuv is working on its own implementation of io_uring as a "backend". First proposed in 2018 (https://github.com/libuv/libuv/issues/1947) but movement started very recently (https://github.com/libuv/libuv/pull/3952, https://github.com/libuv/libuv/pull/3979).

uNetworkingAB commented 1 year ago

Oh ow, I'm up almost 100k messages per second:

Messages per second: 424716.812500 Messages per second: 425812.031250 Messages per second: 423331.312500 Messages per second: 424703.812500

This used to be 325k or something

uNetworkingAB commented 1 year ago

Holy cock!

Req/sec: 469954.250000 Req/sec: 467404.500000 Req/sec: 468516.500000

The latest io_uring Linux "for-next" branch is speedy as all heck

kolinfluence commented 1 year ago

@uNetworkingAB just curious, do u think af_xdp will be faster?

your findings reminds me of this benchmark... https://github.com/pawelgaczynski/gain

uNetworkingAB commented 1 year ago

TLDR - AF_XDP is kernel bypassing, we are not doing that here. The target platform for uWS is "vanilla Linux", not bypasses.

It's also hard to tell if that's actually better than Linux networking given that it depends on what drivers your particular NIC has, such as whether it does full bypass on its own or does TLS offloading, etc.

Bypasses drastically reduce the ease of use / applicability to real world deployments in real companies.

hiqsociety commented 11 months ago

@uNetworkingAB maybe someday can do speed comparison with multicore of https://github.com/bytedance/monoio for http. curious how much perf difference vs rust based program is

uNetworkingAB commented 6 months ago

I've done some thinking and investigation,

The reason why epoll is faster than io_uring for 8kb and 16kb message sending is because io_uring is utilizing much more memory (epoll uses 1 single receive buffer and 1 single send buffer, while io_uring uses per-socket receive buffers and send buffers).

With more memory usage, you won't fit in CPU-cache - which is what epoll does.

If I run 2 processes with epoll, sending TCP messages back and forth with 300k msg/sec on localhost, it only produces a fraction of that in cache misses. This means that the entire hot path of epoll benchmarking on localhost never even writes to RAM - it's entirely to/from CPU cache.

144 million TCP messages sent and received only produces 2 million cache misses. So the entire "message sending" between the two involved processes never even leaves the CPU cache. It's a total short circuit.

So benchmarking on localhost with epoll / io_uring is very misleading unless you can find a good balance between excessive io_uring calls and memory usage.

Most likely, io_uring requires a CPU with much cache - mine only has 6MB while modern CPUs have 96MB. With more cache, it will be easier to efficiently batch io_uring calls in order to get better performance.

I think that, io_uring will never perform as good as epoll on old CPUs for this reason, as there is not enough CPU cache to queue up enough calls to be faster than the equivalence of multiple calls to send syscall.

I need a better set up (hardware) to get apples-to-apples comparison between epoll and io_uring, probably a 10gbit ethernet card with 2 connectors with a wire connectin the two. That way you gurarantee both solutions must go to NIC

billywhizz commented 2 months ago

I need a better set up (hardware) to get apples-to-apples comparison between epoll and io_uring, probably a 10gbit ethernet card with 2 connectors with a wire connectin the two. That way you gurarantee both solutions must go to NIC

your analysis makes sense. i have never been able to make io_uring out-perform epoll in local benches and had assumed it was just a skill issue, but maybe not.

re. the 10Gb - techempower have recently upgraded their hardware to 10Gb between the DB/load generator/Web Server, so it might be worth putting it in there with io_uring enabled and not to see if it makes a difference? I dunno if it's enabled by default in the existing node.js/uWebSockets and Bun implementations on TE.