WebSockets Next: performance improvements

mkouba commented 8 months ago

Description

Follow-up of https://github.com/quarkusio/quarkus/pull/39142.

Implementation ideas

use an AtomicLongFieldUpdater in the ConcurrencyLimiter: https://github.com/quarkusio/quarkus/pull/39142#discussion_r1510855917

mkouba commented 7 months ago

The https://github.com/quarkusio/quarkus/pull/40183 is related to this issue.

mkouba commented 5 months ago

For this one, we'll need some benchmarks (executed automatically). Ideally, to compare WS next with quarkus-websockets and pure Vert.x.

CC @franz1981

franz1981 commented 5 months ago

I'm not aware of websocket benchmark suites sadly...

mkouba commented 5 months ago

I'm not aware of websocket benchmark suites sadly...

Me neither. We'll need to come with something... ;-)

franz1981 commented 5 months ago

Let's talk this week in a call and we can think about something

mkouba commented 5 months ago

NOTE: This extension contains a bunch of lambdas. We should consider rewriting those lambdas to anonymous classes.

cescoffier commented 1 month ago

We need to think about scenarios to test the performances. Response time is not a meaningful metric. The number of messages and connections are more sensitive in this case. (Of course, memory is important too).

franz1981 commented 1 month ago

Yeah, although if you can achieve 10K msg/sec with a single outlier with 10 seconds of latency - is something you wish to know.

I have contacted the Jetty team (Simone Bordet) - and they have rolled out a coordination omission free distributed (if required) load generator for websocket; let's see what we can do in Hyperfoil or by reusing such, which is used for this exact purpose

cescoffier commented 1 month ago

oh, that's would we good!

franz1981 commented 1 month ago

And clearly it is not covering websocket, see https://github.com/jetty-project/jetty-load-generator :"(

Which means that we should prioritize supporting websockets for Hyperfoil or find a different benchmarking tool (which is coordinated-omission free - not easy)

mkouba commented 1 month ago

Which means that we should prioritize supporting websockets for Hyperfoil or find a different benchmarking tool (which is coordinated-omission free - not easy)

For load tests where we don't care about coordinated-omission and throughput, we could try to use the Gatling WebSocket protocol or even a simple Vert.x client.

cescoffier commented 1 month ago

I used Gatling in the past.

mkouba commented 1 month ago

So I started to play with a simple Vertx client, Gatling, etc. here: https://github.com/mkouba/ws-next-perf.

And it seems that at moderate load the performance of quarkus-websockets and quarkus-websockets-next is more or less the same. However under heavy load (in my test it was 10.000 concurrent users sending 1000 and receiving 1000 messages) the performance of quarkus-websockets degrades significantly. I did some CPU profiling and I didn't find anything obvious in the WS next code.

Apparently the biggest problem is switching to a worker thread because the tested @OnTextMessage callback has a blocking signature. If we switch to Uni<String> (i.e. callback executed on the event loop) then the performance is significantly better but still not better than the legacy extension. However, the blocking signature is probably what most users will use anyway...

@franz1981 Could you pls take a look at the attached flamegraph?

cpu-profile.html.zip

franz1981 commented 1 month ago

I see there's a fun problem with synchronizers, which can impact pretty bad scalability and perf (and RSS because "inflated" monitors increase RSS on Hotspot) i.e. io/vertx/core/http/impl/ServerWebSocketImpl.tryHandshake (and io/vertx/core/net/impl/ConnectionBase.queueForWrite as well) which is protecting the handshake via a synchronized guard. You can confirm this by collecting profiling data using -e lock -t (add t as well to see which threads are competing to enter in the lock). The suggestion here is to have less worker threads which compete among each other(s), but they will likely still compete vs the I/O threads (or not - we need the profiling via -e lock -t). I believe the check performed on io/vertx/core/http/impl/ServerWebSocketImpl.tryHandshake can be improved via some volatile guard - and avoid it - and we will fly. I can create a specific vertx microbenchmark for this in vertx itself

important note: I'm at devoxx and I didn't yet look at the bench itself but 2 things after looking at the data:

collect profoling data after waiting a bit before warming up of the application complete: I can see C2 frames meaning that compilation is still going on (after ~10K or less invocations thing will smooth out)
it looks that it is intentionally a cpu bound computation: is it what we expect? I would, instead, add a parametrized fake blocking call (Thread::sleep(configuredFakeBlockingWork) to perform some really blocking behaviour, when we run things on the worker thread pool - this will make more realistic.

The last point is very key to understand that if users are making use of the worker thread pool they are supposed to perform blocking operations (in the form of 10/100 ms work each), this will guarantee 2 effects:

less contention over synchronized part - likely
less "oversubscription" of worker threads vs the available cores i.e. worker threads pool have Math::min(cores * 8, 200) threads - and it means they need to interleave to make progress..if you make them cpu bound performing little non-blocking work, this can stress a LOT other OS mechanism which won't be stressed in the real world

As usual, I love you're so proactive and quick to react @mkouba thanks again for taking both time and effort for the test + collecting data: this will make so much easier for me to help! ❤

franz1981 commented 1 month ago

In addition; this is another low hanging fruit I can help with:

https://github.com/eclipse-vertx/vert.x/blob/916ae9911dbb2a8cf818eee6b5390f62f37fce00/vertx-core/src/main/java/io/vertx/core/http/impl/WebSocketImplBase.java#L478-L493

But gotta better check: I'm adding this note for myself of the future

mkouba commented 1 month ago

I see there's a fun problem with synchronizers, which can impact pretty bad scalability and perf (and RSS because "inflated" monitors increase RSS on Hotspot) i.e. io/vertx/core/http/impl/ServerWebSocketImpl.tryHandshake (and io/vertx/core/net/impl/ConnectionBase.queueForWrite as well) which is protecting the handshake via a synchronized guard.

Yes, I noticed this part as well.

I can create a specific vertx microbenchmark for this in vertx itself

That would be great.

collect profoling data after waiting a bit before warming up of the application complete: I can see C2 frames meaning that compilation is still going on (after ~10K or less invocations thing will smooth out)

cpu-profile.html_02.zip

2. it looks that it is intentionally a cpu bound computation: is it what we expect? I would, instead, add a parametrized fake blocking call (Thread::sleep(configuredFakeBlockingWork) to perform some really blocking behaviour, when we run things on the worker thread pool - this will make more realistic.

It depends. I don't think that all callbacks with a blocking signature will execute code that would block the thread. But for sure, we need more scenarios that would cover all common use cases. Currently, we only call String.toLowerCase() 🤷 .

Thanks Franz!

mkouba commented 1 month ago

FYI I've just noticed the following sentence in the javadoc of io.vertx.core.http.impl.ServerWebSocketImpl: "This class is optimised for performance when used on the same event loop. However it can be used safely from other threads.".

And also "The internal state is protected using the synchronized keyword. If always used on the same event loop, then we benefit from biased locking which makes the overhead of synchronized near zero.".

So obviously, it's not optimized for the blocking use case ;-).

franz1981 commented 1 month ago

@mkouba Yep and ideally this could be improved on Vertx 5, but there is still some low hanging fruit on Vertx 4 - which we can easily explored if is worthy i.e. https://github.com/franz1981/vert.x/commit/3ca72f8ee5aaaa3aebd06791a7a76c60c99a5223 If you want to try this or apply this commit to the right vertx branch you could give it a shot in your benchmark

what is doing is fairly simple, and is based on the analysis I've performed for https://github.com/franz1981/java-puzzles/blob/583d468a58a6ecaa5e7c7c300895392638f688dd/src/main/java/red/hat/puzzles/concurrent/LockCoarsening.java#L76-L85 which is the motivation behind the vertx 5 changes in this regard.

vietj commented 1 month ago

FYI : this part in Vertx 5 has been rewritten, so this analysis does not hold for it

mkouba commented 1 month ago

If you want to try this or apply this commit to the right vertx branch you could give it a shot in your benchmark

Unfortunately, it does not seem to be an easy task to switch the vertx-core version used in Quarkus. You cannot simply change the vertx.version in the BOM because it's used for other Vert.x dependencies (vertx-web, etc.). You cannot set an explicit version the quarkus-vertx runtime because you get dependency convergence errors.

@cescoffier @vietj Any tip how to try this out?

mkouba commented 1 month ago

FYI : this part in Vertx 5 has been rewritten, so this analysis does not hold for it

Hey Julien, do you have some benchmarks in Vert.x to test the performance of WebSockets server/client?

franz1981 commented 1 month ago

Unfortunately, it does not seem to be an easy task to switch the vertx-core version used in Quarkus.

What I would do is to cherry pick the commit to the right vertx tag, use mvn install and either replace the jar in the lib of quarkus or hope that the local mvn repo will do the right thing(TM)

I have found another good improvement to fix the buffer copies too - which I can send to vertx 5 regardless

mkouba commented 1 month ago

Unfortunately, it does not seem to be an easy task to switch the vertx-core version used in Quarkus.

What I would do is to cherry pick the commit to the right vertx tag, use mvn install and either replace the jar in the lib of quarkus or hope that the local mvn repo will do the right thing(TM)

Ah, ofc. This worked. And quick and dirty results seem to be much better, comparable to quarkus-websockets.

franz1981 commented 1 month ago

@mkouba ok so this seems a painless change if @vietj and @cescoffier agreed and you see benefits. I spent some time analysing the weird synchronised behaviour with the vertx code pattern so, sadly, these "workarounds" can be very effective

cescoffier commented 1 month ago

Do you have a link to the commit to cherry-pick?

mkouba commented 1 month ago

Do you have a link to the commit to cherry-pick?

@cescoffier https://github.com/franz1981/vert.x/commit/3ca72f8ee5aaaa3aebd06791a7a76c60c99a5223

cescoffier commented 1 month ago

The commit looks good. It avoids entering synchronized blocks.

I'm not sure of the various assertions.

Let's see what @vietj says.

mkouba commented 1 month ago

The committee looks good. It avoids entering synchronized blocks.

I'm not sure of the various assertions.

Let's see what @vietj says.

@cescoffier What committee? 😆

franz1981 commented 1 month ago

Yep @cescoffier the checks on asserts should be enabled on both quarkus and vertx maven surefire tests to make use the new methods are not misused while still not impacting the hot path at runtime (asserts are fully removed)

franz1981 commented 1 month ago

I have created https://github.com/franz1981/vert.x/commit/9a0f5168bec041ba66811e82867e389a96f84449 to fix the buffer problem saw few comments earlier too

mkouba commented 1 month ago

FYI I'm working on a pull request to disable CDI request context activation for endpoint callbacks unless really needed, i.e. an endpoint has a @RequestScoped dependency or is secured.

quarkusio / quarkus

WebSockets Next: performance improvements #39148

Description

Implementation ideas