Closed chevaris closed 1 year ago
cc @vietj @cescoffier
More data of our benchamrk. -Our micro is reciving POST messages (JSON), and answering back with JSON bodies.
More significant part is that HTTP 2 requires around 18% more CPU to do the same job.
HTTP 1.1 HTTP2
CPU (mcores) 1850 2200
Latency (msecs) 6 6,2
Client conns 2080 20
NOTES:
I will try to profile the code with JFR to check if there is any obvious bottleneck in HTTP2 implementation
Not sure if you have any benchmark aligned with my results
JFR is complaining about very high number of execeptions per sec (ARound 8000 per sec). Most of them coming from Vertx/Quarkus. This is related when the code to close stream (once per request/response pair). JFR suggests avoid using Exceptions for this, because it is more expensive. What do you think? Together with metric with stream closing that Quarkus is tagging it as a REST and CLIENT_ERR, this could be also improved in case this is the main bottleneck
Thanks a lot for the analysis!
We'll definitely need input for @vietj here
I have also the JFR file if needed to get more info
I discussed this issue with @vietj. Julien will look at how we can avoid the extra flush in this case. Note that the current behavior is correct.
Thanks. Main concern is NOT correctness, it is performance. As commented in our previous benchmark, HTTP/2 need 18% more CPU than HTTP 1.1 in our use case and I was looking for differences that could explain that.
I am only guessing because I do NOT know the implementation, BUT started a preliminar analysis and I found the extra message (at least Quarkus HTTP 1.1. is different), and what probably is having a major impact is the close of HTTP2 streams that is handled as an exceptional case (JFR complains about massive exceptions), and this is usually less performant (e.g Exceptions usually handle stack traces that could be expensive with high volumes). Maybe I am totally wrong, BUT wanted to share
These exceptions do not have stacktraces, so they should be fine.
Thanks a lot Clement. Not sure if there is any way I could help. Just let me know
MAybe I could benchmark a predelievery or something similar
hi @chevaris related the exceptions concern: JFR doesn't know that the exception raised isn't populating the stack trace...
Throwable::fillInStackTrace
isn't called in https://github.com/eclipse-vertx/vert.x/blob/2f6220a0c080cb0c76103fdd9ee5775d8898c368/src/main/java/io/vertx/core/impl/NoStackTraceThrowable.java#L20
Thanks @franz1981 for the clarification.
I saw that vertx 4.4.5 is already released and includes https://github.com/eclipse-vertx/vert.x/pull/4775 that I assume should improve HTTP2 performance. Once there is a Quarkus version using vertx 4.4.5 I can repeat the benchmark again.
Is that OK?
Yes, we are working on the integration of Vert.x 4.4.5 at this very moment. Expect something on Monday.
Out of date.
Thanks a lot for the improvement.
Currently I do NOT have access to HW, BUT as soon I can get that I will benchmark the microservice as I did before and provide data comparing HTTP 1.1 and HTTP2
@chevaris I am eagerly awaiting some fresh benchmarks now that 3.4.3 was released today.
I found this issue today after benchmarking HTTP/1 vs HTTP/2 in our Quarkus app (2.16.6.Final). It took us by surprise that HTTP/2 resulted in less throughput overall. 🤞 3.4.3 this improves things. We'll be benchmarking again on Monday and I will report back if you have not already...if the upgrade to 3.4.3 is trivial
Any news @cjbooms ?
Any news @cjbooms ?
I don't have a public test harness to share, but below is our internal results with v2/v3 quarkus and http1.1/http2.
Quarkus Version | HTTP Version | rps | p50 | p95 |
---|---|---|---|---|
v3 | http2 | 240 | 73 | 100 |
v3 | http1.1 | 300 | 58 | 72 |
v2 | http2 | 335 | 52 | 64 |
v2 | http1.1 | 300 | 58 | 72 |
Clear winner is v2, http2. Not sure why, but http2 appears to have degraded in v3... Both versions of quarkus agree on http1.1 speeds.
Notes:
Thanks @cjbooms If you could collect some flamegraphs made with async profiler out of the 2 versions I can quickly find what's going on (you can use jfrsync too while using it, which would produce a single jfr file which I can use to extract several different profiling events). Let me know If I can help you to set anything up on this, which will reduce dramatically the time to investigate... Consider that http 1.1 is kind of a fault of us, previously it was slower than http 2, but we focus out efforts on improving the Netty decoding path till Vertx and... that's the result :P
Thanks @cjbooms If you could collect some flamegraphs made with async profiler out of the 2 versions I can quickly find what's going on (you can use jfrsync too while using it, which would produce a single jfr file which I can use to extract several different profiling events). Let me know If I can help you to set anything up on this, which will reduce dramatically the time to investigate... Consider that http 1.1 is kind of a fault of us, previously it was slower than http 2, but we focus out efforts on improving the Netty decoding path till Vertx and... that's the result :P
Yes, but it will be awhile. We wont be picking up this topic again until after Cyber Week.
Sorry for taking so late to answer back
My benchmark shows different results and aligned what the issue reported in this topic. When using HTTP 2 the quarkus server is using a significant amount of extra CPU compared with HTTP 1.1 (15-17% aprox more) and latencies are worse. I am intrigued with your results and why with my application is diverging so much when sending traffic with HTTP 1.1 or HTTP 2 (server not restarted, JVM properly warm up)
Quarkus version: 3.6.3 - Openjdk 17.0.9 Benchmark tool Hyperfoil 0.24-2 (Also tried with hyperfoil 0.25-2) Benchmark test - Constant rate 4000 reqs / sec (POST requests in this case). Intel core i9 with Manjaro ( 6.1.68-1-MANJARO #1 SMP PREEMPT_DYNAMIC Thu Dec 14 00:46:56 UTC 2023 x86_64 GNU/Linux)
Benchamrk running for 3 mins for each config (several warmup rounds)
HTTP/2 (10 connection / max 100 stream per connection) All operation answered with 2xx Runnn Mean Latency 967 microsecs , P50 856 microsec , P99 1,50 msecs, Quarkus CPU usage 1,30% (1,3 cores)
HTTP 1.1 (100 connections) All operation answered with 2xx Mean latency 923 microsecs, P50 819 microsec, P99 2,49 msecs, Quarkus CPU usage 1,12% (1,12 cores)
Which benchmark tool are you using? Can you ellaborate on the kind of operations , latencies, etc? I have tried with other configs in terms of numbers connections, streams per connection, etc and HTTP 1.1. is always outperforming HTTP2 implementation in my benchmark.
At least in my recent experience, Vertx http2 stack is less efficient compared with HTTP 1.1. I have been using the vertx http proxy (https://vertx.io/docs/vertx-http-proxy/java/) module lately and when the proxy HTTP client is using HTTP2 the results are also significantly worse than using HTTP 1.1 (In this case the latencies are degraded heavily compared with HTTP 1.1).
Thanks,
Evaristo
It's difficult to compare the two protocols this way. Try to use a single I/O thread (and not any blocking thread pool in the request path) and configure the same number of physical connections for both and measure the peak throughput for both, to see what the maxim capacity.
You can than both try increase the number of streams, but beware, by definition this is prone to some queuing effects, because they will always be served from the same connection in the same I/O thread.
Vertex and Netty, without any specific configuration round robin assign physical connections among available I/O threads, while streams are served from the same physical connection.
And beware (ie I didn't checked really, what's the configuration for quarkus related number of I/O threads which can serve HTTP 2 - that's why I suggest to avoid any quirk related it).
Other suggestions: verify if all the configured connections are being used and how much, too.
This so trivial Vertx app (Taken from https://vertx.io/docs/vertx-web-proxy/java/) uses more CPU and worses latencies at any number of TPSs with HTTP/2 than with HTTP 1.1 (at least with Hyperfoil)
HttpServer backendServer = vertx.createHttpServer();
Router backendRouter = Router.router(vertx);
backendRouter.route(HttpMethod.GET, "/foo").handler(rc -> { rc.response() .putHeader("content-type", "text/html") .end("
backendServer.requestHandler(backendRouter).listen(7070);
Please try constraining the number of vertx cores to one and use the same number of physical connections to both protocols. If you prefer to use the default number of cores, check how the load is distributed among them (in term of CPU usage, instead of capturing the overall CPU consumption): I am not complaining against your results, but I suggest to provide other observations to your test to help understand if is an inherent limit of the way vertx/Netty implement the 2 protocols, some "default configuration quirk" of vertx/quarkus or just the result of wrong assumptions (or both) on how they should behave under what looks to be similar conditions.
And please provide the Hyperfoil yaml to replicate the test, to be sure we can perform the same test of you.
Is highly appreciated if you can collect profiling data using async profiler, possibly using -t option.
Adding @vietj in case he got something to share
I still do NOT understand the benchmark that was referred in this ticket to say that HTTP/2 performance is better than HTTP 1.1. I do NOT doubt that in your benchmark HTTP/2 is better BUT It is NOT clear for me what you are testing (Operations, connections, latencies, etc) and what is the angle you are using to say that. Could you clarify the use case, units of the table , etc. I could say that you are using around 300 request per seconds with latencies around 50 - 100 msecs. Am I OK? Could you share your benchamrk files and the size aprox of the responses? I am assume that you are using really hughe documents or a very small amount of HW for the benchmark to have the figures in the table
My use case is very simple. 2 microservice communicating with HTTP REST APIs. Very simple / response protocol (No like browser with CSS; images, javascript, etc). Request are POST with very small JSON and responses are JSON around 4KB.
Regarding your suggestions: Why to use the same number of connections?
I do NOT think that HTTP 1.1 with pipelining is the right choice to send requests NOT related between them due due to the ordering required by HTTP 1.1 pipelining (could make sense for browsers, BUT most of the browser are using actually a pool of HTTP1.1 connections.) Anyhow I tested it and result are also better than HTTP/2
The benchmark I am running shows that in order to communicate with a Quarkus micro, is more efficient using a big enough pool of HTTP 1.1 connections than using an smaller pool of fat HTTP 2 connections. I tried with multiple combinations of HTTP/2 streams and number of connections without any success (to discard that TCP flow control could be related). Here more efficient means less replicas of the microservice are needed to handle the same amount of load (and of top of that latencies are better).
I already reported the results from the profiling I did (with my limited capability and considering that I do NOT know the code) and I reported 3 things:
This is the hyperfoil file I used for HTTP/2
name: chevaConstantRate threads: 2 http: host: http://localhost:8080 sharedConnections: 10 allowHttp1x: false maxHttp2Streams: 100 ergonomics: # Disable stopping the scenario on 4xx or 5xx response autoRangeCheck: false phases:
For HTTP 1.1 I replaced by
sharedConnections: 100 allowHttp1x: true
I also tried HTTP 1.1 (with pipelining) sharedConnections: 10 allowHttp1x: true pipeliningLimit: 100
Summary of the results:
Quarkus version: 3.6.3 - Openjdk 17.0.9 2 event loops (Quarkus by default is using 2 or more event loops) Benchmark tool Hyperfoil 0.24-2 (Also tried with hyperfoil 0.25-2) Benchmark test - Constant rate 4000 reqs / sec (POST requests in this case). Intel core i9 with Manjaro ( 6.1.68-1-MANJARO https://github.com/quarkusio/quarkus/pull/1 SMP PREEMPT_DYNAMIC Thu Dec 14 00:46:56 UTC 2023 x86_64 GNU/Linux)
Benchamrk running for 3 mins for each config (several warmup rounds)
HTTP/2 (10 connection / max 100 stream per connection) All operation answered with 2xx Mean Latency 967 microsecs , P50 856 microsec , P99 1,50 msecs, Quarkus CPU usage 1,30% (1,3 cores)
HTTP 1.1 (100 connections) All operation answered with 2xx Mean latency 923 microsecs, P50 819 microsec, P99 2,49 msecs, Quarkus CPU usage 1,12% (1,12 cores)
HTTP 1.1 pipelining limit (10 connections / 100 pipelining limit per connection) Very similar results that with HTTP 1.1 Mean latency 910 microsecs, P50 796 microsec, P99 2,44 msecs, Quarkus CPU usage 1,12% (1,12 cores)
Summary: In my use case HTTP 1.1 with pipelining is better that HTTP/2. Still NOT recommended because a single heavy call will delay other calls HTTP 1.1 with bigger pool of connections better that HTTP/2
Support of pipelining for Hyperfoil in http 1.1 is sadly broken (I have to fix it yet, given that I am a project committer), hence I suggests to ignore its results.
Related
I still do NOT understand the benchmark that was referred in this ticket to say that HTTP/2 performance is better than HTTP 1.1
The results from @cjbooms seems to agree that they have degraded performance in Quarkus v3 at a point that the http 1.1 performance (rps) are better than http 2 (300 vs 240), which doesn't seem to disagree with your numbers: http 2 isn't faster in v3. I agree anyway that the use cases could be very different and not comparable, anyway.
Regarding your suggestions: Why to use the same number of connections?
Because the way Netty handle parallelism/concurrency with streams vs physical connections and the way HOL can bite the streams in case a single response isn't sent in one go, causing others to be queued up. The more physical connections the more real concurrency exist, unless Netty can chunk responses, allowing interleaving them. The reason why I was asking the number of cores, constraining to one, was to rule out that http 2 physical connections were correct assigned to different physical cores, granting some parallelism. Http 1.1 by default always do it, while with http 2 I am not sure (meaning; I don't know).
Returning on the topic: hope this week before Christmas to have a look at your reproducer and report any finding.
Please @cjbooms could you create a new issue which report just the comment about the http 2 performance degradation if compared to V2? I would like to keep those issues separated to save being confused while looking at both.
Support of pipelining for Hyperfoil in http 1.1 is sadly broken (I have to fix it yet, given that I am a project committee).
I saw you in some tickets . It is a great tool !!!!
Related the exception performance issue, I agree indeed that is to be investigated, but looking at the past comments I see there is a change from Vertx which should fix that: are you still observing the same behaviour?
Tested with Quarkus 3.32 and 3.6.3 (Both same behaviour and heavy amount of exceptions)
It is a great tool !!!!
Thanks and happy you have used it!
Tested with Quarkus 3.32 and 3.6.3 (Both same behaviour and heavy amount of exceptions)
I should check if the changes are in (likely) and what they were meant to solve
for a better comparison between HTTP/1 and H2 I think you should lower the number of max concurrent stream per connection, specially if you are using a small number of H2 connection (10), instead increase the number of H2 connection and decrease the max number of concurrent stream, e.g. you could try 100 H2 connections with a max concurrent stream 10.
A small number of connection (compared to the number of cores) will put more load on some core than others, using more connections with small max stream tends to spread the load in a better way.
of course this is a recommendation for a benchmark.
@chevaris As mentioned by @vietj 's comment: a fair comparison for such synthetic case requires some test adjustment. In any case I've investigated about the difference of behavior between HTTP 1.1 and 2 for 2 types of "simple" test:
And it has shows few low hanging fruits, reported at https://github.com/quarkusio/quarkus/issues/37835 and others, more complex, reported at https://github.com/eclipse-vertx/vert.x/pull/5047.
Currently the vertx 4.x branch already contains https://github.com/eclipse-vertx/vert.x/commit/841d6fb23a70d42c03bad4bfb2c5941fca212ca0 and https://github.com/eclipse-vertx/vert.x/commit/48459044f218ff90f5a019849521d87ba6a260e4 which already address some evident cost due to authority
's validation, while other changes are still in progress in Netty (eg https://github.com/netty/netty/pull/13742) which improve header's lookup performance, which happen way more frequently than with http 1.1
Another low hanging fruit (although still not obvious to fix), is the pseudo-header lookup and validation cost in Http2's Netty which seems to happen too much and too often in the HPACK's decode paths (ie https://github.com/netty/netty/blob/427ace4cdd86cce84e467ec0a8a81ca022ed6f72/codec-http2/src/main/java/io/netty/handler/codec/http2/HpackDecoder.java#L559-L561
) and visible at
While the last standing 3 differences are yet:
io/netty/handler/codec/http2/DefaultHttp2Connection$DefaultEndpoint.createStream
io/vertx/core/http/impl/Http2ServerConnection.createStream
(which cost has already been halved by @vietj 's changed mentioned above) io/netty/channel/CoalescingBufferQueue.remove
(still under investigation)Which make clear that the whole "stream" concept in Http 2 doesn't come for free and it has its costs, especially for cases as simple as these ones, but clearly some of the overhead could be removed.
Generally speaking the best we could improve directly within vertx (hence by consequence, for quarkus) has already been done for what was detected as a problem.
If you have the chance to compile vertx 4.x and run the experiment you're use to run, you can verify that things are getting in a better shape; it will take to roll a new release before the changes will be visibile to Quarkus, but it's a matter of time.
@chevaris @cjbooms update on this: I have further progressed into "fixing" the performance differences between HTTP 1.1 and 2 and found many others small/big changes, sent directly to Netty eg
Some already merged and others in the process of being reviewed. Additionally, others related a deficiency to scale eg https://github.com/netty/netty/pull/13741
My take on HTTP 2 is that, under realistic and correct usage, is a great protocol to reduce the required physical connections and improve the network usage (thanks to HPack caching/encoding), but in cases where:
It adds an inherent cost of managing the streams, including distributing fairly their traffic, coalescing writes and creating them in the hot path, which makes HTTP 1.1 just faster, in its peak performance. This is specially true with pipelining, which in HTTP 1.1 doesn't have any infrastructure to handle concurrency, making it naturally prone to Head-Of-Line problems, but able to maximize throughput.
This has been a surprising fact to me, but it is what it is. Said that, we have addressed the most of the evident (and less evident) inneficiencies we have found (- and I have another couple in-flight) saving wasteful work and reducing (sometime dramatically) its CPU usage, but under the mentioned above conditions its peak performance won't be as good HTTP 1.1.
Just adding this, but take it with the grain of salt: the overall improvement in peak cpu saving has been around 35-40% applying all fixes to quarkus
Really thanks a lot for the very detailed work on this and the support!!!!
I think you made a very good summary, and as you commented in some cases the stream handling could less performant that using extra connections. I actually got better results decreasing number of streams and using more connections as suggested here.
I think is really very good to see all the improvements coming making vertx /quarkus HTTP2 stack even better (It is already great compared with other options ). The more than I use it , the more than I like it.
Description
BACKGROUND I have implemented a Quarkus based microservice that is targeting to replace a Spring Boot implementation
Microservice receives POST (JSON) requests and provide answer with JSON.
LIMITATION WITH HTTP/2 We have observed that latencies when using HTTP/2 are worse that when we are using HTTP 1.1 (0,5 msecs aprox per request). CPU usage is also higher (between 5-10%). Obviously in HTTP/2 connections the number of connections needed to sustain the same throughput is much lower (multiplexing in HTTP2) This is NOT happening in Spring (Jetty) implementation in which HTTP/2 latencies are aprox the same compared with Spring Boot HTTP 1.1.
GOAL OF THIS TICKET Purpose of the ticket is to check why HTTP/2 latencies are worse (at least in microservices with long living connections) compared with HTTP 1.1 and provide a fix
INITIAL ANALYSIS (IN CASE COULD HELP) After some analysis we have found a difference in Quarkus HTTP/2 compared with (QUarkus HTTP 1.1 or Spring Jetty HTTP/2) that could explain the performance drop (worse latency)
We have captured packets for each implementation (Images attached). This is the result. QUarkus HTTP/2 is using one extra message compared with other implementations. Any reason for that? At least in this use case, I do NOT see the need to avoid sending Headers and response data in the same packet. I can understand that streaming use cases could be different.
1.- Quarkus HTTP/2 Client --------- HTTP2/JSON POST HEADERS + DATA -----------------------> Quarkus server Quarkus --------- HTTP2 HEADERS (200 OK)-----------------------------------------> Client Client ----------------------------------------------------------------------------------> ACK Quarkus --------- HTTP2/JSON DATA (END STREAM) -------------------------------> cLIENT Client ----------------------------------------------------------------------------------> ACK
2.- Quarkus HTTP/1.1 Client --------- HTTP2/JSON POST HEADERS + DATA -----------------> Quarkus server Quarkus --------- HTTP2 HEADERS (200 OK) + DATA -------------------------> Client Client ---------------------------------------------------------------------------> ACK
3.- Spring (Jetty) HTTP/2 Client --------- HTTP2/JSON POST HEADERS + DATA -----------------> Spring server Spring server --------- HTTP2 HEADERS (200 OK) + DATA (END STREAM)-------> Client Client ------------------------------------------------------------ --------------> ACK
Reproducer to check network packages
code-with-quarkus.zip
Send traffic with curl or wrk or hyperfoil. Capture with wireshark curl -v --http2 -d '{"name": "juan"}' -H "Content-Type: application/json" -X POST http://localhost:8080/hello
Implementation ideas
No response