Open mkouba opened 8 months ago
The https://github.com/quarkusio/quarkus/pull/40183 is related to this issue.
For this one, we'll need some benchmarks (executed automatically). Ideally, to compare WS next with quarkus-websockets
and pure Vert.x.
CC @franz1981
I'm not aware of websocket benchmark suites sadly...
I'm not aware of websocket benchmark suites sadly...
Me neither. We'll need to come with something... ;-)
Let's talk this week in a call and we can think about something
NOTE: This extension contains a bunch of lambdas. We should consider rewriting those lambdas to anonymous classes.
We need to think about scenarios to test the performances. Response time is not a meaningful metric. The number of messages and connections are more sensitive in this case. (Of course, memory is important too).
Yeah, although if you can achieve 10K msg/sec with a single outlier with 10 seconds of latency - is something you wish to know.
I have contacted the Jetty team (Simone Bordet) - and they have rolled out a coordination omission free distributed (if required) load generator for websocket; let's see what we can do in Hyperfoil or by reusing such, which is used for this exact purpose
oh, that's would we good!
And clearly it is not covering websocket, see https://github.com/jetty-project/jetty-load-generator :"(
Which means that we should prioritize supporting websockets for Hyperfoil or find a different benchmarking tool (which is coordinated-omission free - not easy)
Which means that we should prioritize supporting websockets for Hyperfoil or find a different benchmarking tool (which is coordinated-omission free - not easy)
For load tests where we don't care about coordinated-omission and throughput, we could try to use the Gatling WebSocket protocol or even a simple Vert.x client.
I used Gatling in the past.
So I started to play with a simple Vertx client, Gatling, etc. here: https://github.com/mkouba/ws-next-perf.
And it seems that at moderate load the performance of quarkus-websockets
and quarkus-websockets-next
is more or less the same. However under heavy load (in my test it was 10.000 concurrent users sending 1000 and receiving 1000 messages) the performance of quarkus-websockets
degrades significantly. I did some CPU profiling and I didn't find anything obvious in the WS next code.
Apparently the biggest problem is switching to a worker thread because the tested @OnTextMessage
callback has a blocking signature. If we switch to Uni<String>
(i.e. callback executed on the event loop) then the performance is significantly better but still not better than the legacy extension. However, the blocking signature is probably what most users will use anyway...
@franz1981 Could you pls take a look at the attached flamegraph?
I see there's a fun problem with synchronizer
s, which can impact pretty bad scalability and perf (and RSS because "inflated" monitors increase RSS on Hotspot) i.e. io/vertx/core/http/impl/ServerWebSocketImpl.tryHandshake
(and io/vertx/core/net/impl/ConnectionBase.queueForWrite
as well) which is protecting the handshake via a synchronized
guard.
You can confirm this by collecting profiling data using -e lock -t
(add t as well to see which threads are competing to enter in the lock).
The suggestion here is to have less worker threads which compete among each other(s), but they will likely still compete vs the I/O threads (or not - we need the profiling via -e lock -t
).
I believe the check performed on io/vertx/core/http/impl/ServerWebSocketImpl.tryHandshake
can be improved via some volatile guard - and avoid it - and we will fly.
I can create a specific vertx microbenchmark for this in vertx itself
important note: I'm at devoxx and I didn't yet look at the bench itself but 2 things after looking at the data:
(Thread::sleep(configuredFakeBlockingWork)
to perform some really blocking behaviour, when we run things on the worker thread pool - this will make more realistic. The last point is very key to understand that if users are making use of the worker thread pool they are supposed to perform blocking operations (in the form of 10/100 ms work each), this will guarantee 2 effects:
synchronized
part - likelyAs usual, I love you're so proactive and quick to react @mkouba thanks again for taking both time and effort for the test + collecting data: this will make so much easier for me to help! ❤
In addition; this is another low hanging fruit I can help with:
But gotta better check: I'm adding this note for myself of the future
I see there's a fun problem with
synchronizer
s, which can impact pretty bad scalability and perf (and RSS because "inflated" monitors increase RSS on Hotspot) i.e.io/vertx/core/http/impl/ServerWebSocketImpl.tryHandshake
(andio/vertx/core/net/impl/ConnectionBase.queueForWrite
as well) which is protecting the handshake via asynchronized
guard.
Yes, I noticed this part as well.
I can create a specific vertx microbenchmark for this in vertx itself
That would be great.
- collect profoling data after waiting a bit before warming up of the application complete: I can see C2 frames meaning that compilation is still going on (after ~10K or less invocations thing will smooth out)
2. it looks that it is intentionally a cpu bound computation: is it what we expect? I would, instead, add a parametrized fake blocking call
(Thread::sleep(configuredFakeBlockingWork)
to perform some really blocking behaviour, when we run things on the worker thread pool - this will make more realistic.
It depends. I don't think that all callbacks with a blocking signature will execute code that would block the thread. But for sure, we need more scenarios that would cover all common use cases. Currently, we only call String.toLowerCase()
🤷 .
Thanks Franz!
FYI I've just noticed the following sentence in the javadoc of io.vertx.core.http.impl.ServerWebSocketImpl
: "This class is optimised for performance when used on the same event loop. However it can be used safely from other threads.".
And also "The internal state is protected using the synchronized keyword. If always used on the same event loop, then we benefit from biased locking which makes the overhead of synchronized near zero.".
So obviously, it's not optimized for the blocking use case ;-).
@mkouba Yep and ideally this could be improved on Vertx 5, but there is still some low hanging fruit on Vertx 4 - which we can easily explored if is worthy i.e. https://github.com/franz1981/vert.x/commit/3ca72f8ee5aaaa3aebd06791a7a76c60c99a5223 If you want to try this or apply this commit to the right vertx branch you could give it a shot in your benchmark
what is doing is fairly simple, and is based on the analysis I've performed for https://github.com/franz1981/java-puzzles/blob/583d468a58a6ecaa5e7c7c300895392638f688dd/src/main/java/red/hat/puzzles/concurrent/LockCoarsening.java#L76-L85 which is the motivation behind the vertx 5 changes in this regard.
FYI : this part in Vertx 5 has been rewritten, so this analysis does not hold for it
If you want to try this or apply this commit to the right vertx branch you could give it a shot in your benchmark
Unfortunately, it does not seem to be an easy task to switch the vertx-core
version used in Quarkus. You cannot simply change the vertx.version
in the BOM because it's used for other Vert.x dependencies (vertx-web
, etc.). You cannot set an explicit version the quarkus-vertx
runtime because you get dependency convergence errors.
@cescoffier @vietj Any tip how to try this out?
FYI : this part in Vertx 5 has been rewritten, so this analysis does not hold for it
Hey Julien, do you have some benchmarks in Vert.x to test the performance of WebSockets server/client?
Unfortunately, it does not seem to be an easy task to switch the vertx-core version used in Quarkus.
What I would do is to cherry pick the commit to the right vertx tag, use mvn install and either replace the jar in the lib of quarkus or hope that the local mvn repo will do the right thing(TM)
I have found another good improvement to fix the buffer copies too - which I can send to vertx 5 regardless
Unfortunately, it does not seem to be an easy task to switch the vertx-core version used in Quarkus.
What I would do is to cherry pick the commit to the right vertx tag, use mvn install and either replace the jar in the lib of quarkus or hope that the local mvn repo will do the right thing(TM)
Ah, ofc. This worked. And quick and dirty results seem to be much better, comparable to quarkus-websockets
.
@mkouba ok so this seems a painless change if @vietj and @cescoffier agreed and you see benefits. I spent some time analysing the weird synchronised behaviour with the vertx code pattern so, sadly, these "workarounds" can be very effective
Do you have a link to the commit to cherry-pick?
Do you have a link to the commit to cherry-pick?
@cescoffier https://github.com/franz1981/vert.x/commit/3ca72f8ee5aaaa3aebd06791a7a76c60c99a5223
The commit looks good. It avoids entering synchronized blocks.
I'm not sure of the various assertions.
Let's see what @vietj says.
The committee looks good. It avoids entering synchronized blocks.
I'm not sure of the various assertions.
Let's see what @vietj says.
@cescoffier What committee? 😆
Yep @cescoffier the checks on asserts should be enabled on both quarkus and vertx maven surefire tests to make use the new methods are not misused while still not impacting the hot path at runtime (asserts are fully removed)
I have created https://github.com/franz1981/vert.x/commit/9a0f5168bec041ba66811e82867e389a96f84449 to fix the buffer problem saw few comments earlier too
FYI I'm working on a pull request to disable CDI request context activation for endpoint callbacks unless really needed, i.e. an endpoint has a @RequestScoped
dependency or is secured.
Description
Follow-up of https://github.com/quarkusio/quarkus/pull/39142.
Implementation ideas
AtomicLongFieldUpdater
in theConcurrencyLimiter
: https://github.com/quarkusio/quarkus/pull/39142#discussion_r1510855917