chevaris commented 1 year ago

Description

BACKGROUND I have implemented a Quarkus based microservice that is targeting to replace a Spring Boot implementation

Microservice receives POST (JSON) requests and provide answer with JSON.

LIMITATION WITH HTTP/2 We have observed that latencies when using HTTP/2 are worse that when we are using HTTP 1.1 (0,5 msecs aprox per request). CPU usage is also higher (between 5-10%). Obviously in HTTP/2 connections the number of connections needed to sustain the same throughput is much lower (multiplexing in HTTP2) This is NOT happening in Spring (Jetty) implementation in which HTTP/2 latencies are aprox the same compared with Spring Boot HTTP 1.1.

GOAL OF THIS TICKET Purpose of the ticket is to check why HTTP/2 latencies are worse (at least in microservices with long living connections) compared with HTTP 1.1 and provide a fix

INITIAL ANALYSIS (IN CASE COULD HELP) After some analysis we have found a difference in Quarkus HTTP/2 compared with (QUarkus HTTP 1.1 or Spring Jetty HTTP/2) that could explain the performance drop (worse latency)

We have captured packets for each implementation (Images attached). This is the result. QUarkus HTTP/2 is using one extra message compared with other implementations. Any reason for that? At least in this use case, I do NOT see the need to avoid sending Headers and response data in the same packet. I can understand that streaming use cases could be different.

1.- Quarkus HTTP/2 Client --------- HTTP2/JSON POST HEADERS + DATA -----------------------> Quarkus server Quarkus --------- HTTP2 HEADERS (200 OK)-----------------------------------------> Client Client ----------------------------------------------------------------------------------> ACK Quarkus --------- HTTP2/JSON DATA (END STREAM) -------------------------------> cLIENT Client ----------------------------------------------------------------------------------> ACK

Quarkus-HTTP2

2.- Quarkus HTTP/1.1 Client --------- HTTP2/JSON POST HEADERS + DATA -----------------> Quarkus server Quarkus --------- HTTP2 HEADERS (200 OK) + DATA -------------------------> Client Client ---------------------------------------------------------------------------> ACK

Quarkus-HTTP1 1 jpg

3.- Spring (Jetty) HTTP/2 Client --------- HTTP2/JSON POST HEADERS + DATA -----------------> Spring server Spring server --------- HTTP2 HEADERS (200 OK) + DATA (END STREAM)-------> Client Client ------------------------------------------------------------ --------------> ACK

Spring-Jetty-HTTP2

Reproducer to check network packages

code-with-quarkus.zip

Send traffic with curl or wrk or hyperfoil. Capture with wireshark curl -v --http2 -d '{"name": "juan"}' -H "Content-Type: application/json" -X POST http://localhost:8080/hello

Implementation ideas

No response

geoand commented 1 year ago

cc @vietj @cescoffier

chevaris commented 1 year ago

More data of our benchamrk. -Our micro is reciving POST messages (JSON), and answering back with JSON bodies.

Benchamrk done with one of our micros running Quarkus 3.1.3.Final in exactly the same conditions (5000 request/sec) and using HTTP 1.1 or HTTP2 (H2C). VMs warm-up (JVM mode, and C2 job mostly done). We simply changed the Istio configuration to send HTTP 1.1 or HTTP/2 (So micro running exactly in the same conditions, even NOT restarted)

More significant part is that HTTP 2 requires around 18% more CPU to do the same job.

                                    HTTP 1.1                HTTP2
CPU (mcores)                  1850                       2200
Latency (msecs)               6                             6,2
Client conns                  2080                          20

NOTES:

We have 10 Ingress GWs balancing traffic between micros. Minimum HTTP2 connection between GW and micro is 2 connections, but can expand if needed
Client connections obtained from Quarkus metris
CPU usage obtained with kubectl top, but aligned with Quarkus metrics
PODs running bellow guaranteed limits (request 3cores, limit 5 cores). No throttling
In house K8s deployment. No overprovisioning.
Results are reproducible in different runs, different hours, etc

I will try to profile the code with JFR to check if there is any obvious bottleneck in HTTP2 implementation

Not sure if you have any benchmark aligned with my results

chevaris commented 1 year ago

JFR is complaining about very high number of execeptions per sec (ARound 8000 per sec). Most of them coming from Vertx/Quarkus. This is related when the code to close stream (once per request/response pair). JFR suggests avoid using Exceptions for this, because it is more expensive. What do you think? Together with metric with stream closing that Quarkus is tagging it as a REST and CLIENT_ERR, this could be also improved in case this is the main bottleneck

jfr-http2-quarkus

geoand commented 1 year ago

Thanks a lot for the analysis!

We'll definitely need input for @vietj here

chevaris commented 1 year ago

I have also the JFR file if needed to get more info

cescoffier commented 1 year ago

I discussed this issue with @vietj. Julien will look at how we can avoid the extra flush in this case. Note that the current behavior is correct.

chevaris commented 1 year ago

Thanks. Main concern is NOT correctness, it is performance. As commented in our previous benchmark, HTTP/2 need 18% more CPU than HTTP 1.1 in our use case and I was looking for differences that could explain that.

I am only guessing because I do NOT know the implementation, BUT started a preliminar analysis and I found the extra message (at least Quarkus HTTP 1.1. is different), and what probably is having a major impact is the close of HTTP2 streams that is handled as an exceptional case (JFR complains about massive exceptions), and this is usually less performant (e.g Exceptions usually handle stack traces that could be expensive with high volumes). Maybe I am totally wrong, BUT wanted to share

cescoffier commented 1 year ago

These exceptions do not have stacktraces, so they should be fine.

cescoffier commented 1 year ago

See https://github.com/eclipse-vertx/vert.x/pull/4775.

chevaris commented 1 year ago

Thanks a lot Clement. Not sure if there is any way I could help. Just let me know

MAybe I could benchmark a predelievery or something similar

franz1981 commented 1 year ago

hi @chevaris related the exceptions concern: JFR doesn't know that the exception raised isn't populating the stack trace... Throwable::fillInStackTrace isn't called in https://github.com/eclipse-vertx/vert.x/blob/2f6220a0c080cb0c76103fdd9ee5775d8898c368/src/main/java/io/vertx/core/impl/NoStackTraceThrowable.java#L20

chevaris commented 1 year ago

Thanks @franz1981 for the clarification.

I saw that vertx 4.4.5 is already released and includes https://github.com/eclipse-vertx/vert.x/pull/4775 that I assume should improve HTTP2 performance. Once there is a Quarkus version using vertx 4.4.5 I can repeat the benchmark again.

Is that OK?

cescoffier commented 1 year ago

Yes, we are working on the integration of Vert.x 4.4.5 at this very moment. Expect something on Monday.

cescoffier commented 1 year ago

Out of date.

chevaris commented 1 year ago

Thanks a lot for the improvement.

Currently I do NOT have access to HW, BUT as soon I can get that I will benchmark the microservice as I did before and provide data comparing HTTP 1.1 and HTTP2

cjbooms commented 1 year ago

@chevaris I am eagerly awaiting some fresh benchmarks now that 3.4.3 was released today.

I found this issue today after benchmarking HTTP/1 vs HTTP/2 in our Quarkus app (2.16.6.Final). It took us by surprise that HTTP/2 resulted in less throughput overall. 🤞 3.4.3 this improves things. We'll be benchmarking again on Monday and I will report back if you have not already...if the upgrade to 3.4.3 is trivial

franz1981 commented 11 months ago

Any news @cjbooms ?

cjbooms commented 11 months ago

Any news @cjbooms ?

I don't have a public test harness to share, but below is our internal results with v2/v3 quarkus and http1.1/http2.

Quarkus Version	HTTP Version	rps	p50	p95
v3	http2	240	73	100
v3	http1.1	300	58	72
v2	http2	335	52	64
v2	http1.1	300	58	72

Clear winner is v2, http2. Not sure why, but http2 appears to have degraded in v3... Both versions of quarkus agree on http1.1 speeds.

Notes:

We use a Locust test suite which has variance between runs, so take these results with a pinch of salt.
We reworked our tests to remove an intermediary load-balancer hop, which seems to have made http2 perform better than http1.1 in Quarkus V2.

franz1981 commented 11 months ago

Thanks @cjbooms If you could collect some flamegraphs made with async profiler out of the 2 versions I can quickly find what's going on (you can use jfrsync too while using it, which would produce a single jfr file which I can use to extract several different profiling events). Let me know If I can help you to set anything up on this, which will reduce dramatically the time to investigate... Consider that http 1.1 is kind of a fault of us, previously it was slower than http 2, but we focus out efforts on improving the Netty decoding path till Vertx and... that's the result :P

cjbooms commented 11 months ago

Thanks @cjbooms If you could collect some flamegraphs made with async profiler out of the 2 versions I can quickly find what's going on (you can use jfrsync too while using it, which would produce a single jfr file which I can use to extract several different profiling events). Let me know If I can help you to set anything up on this, which will reduce dramatically the time to investigate... Consider that http 1.1 is kind of a fault of us, previously it was slower than http 2, but we focus out efforts on improving the Netty decoding path till Vertx and... that's the result :P

Yes, but it will be awhile. We wont be picking up this topic again until after Cyber Week.

chevaris commented 11 months ago

Sorry for taking so late to answer back

My benchmark shows different results and aligned what the issue reported in this topic. When using HTTP 2 the quarkus server is using a significant amount of extra CPU compared with HTTP 1.1 (15-17% aprox more) and latencies are worse. I am intrigued with your results and why with my application is diverging so much when sending traffic with HTTP 1.1 or HTTP 2 (server not restarted, JVM properly warm up)

Quarkus version: 3.6.3 - Openjdk 17.0.9 Benchmark tool Hyperfoil 0.24-2 (Also tried with hyperfoil 0.25-2) Benchmark test - Constant rate 4000 reqs / sec (POST requests in this case). Intel core i9 with Manjaro ( 6.1.68-1-MANJARO #1 SMP PREEMPT_DYNAMIC Thu Dec 14 00:46:56 UTC 2023 x86_64 GNU/Linux)

Benchamrk running for 3 mins for each config (several warmup rounds)

HTTP/2 (10 connection / max 100 stream per connection) All operation answered with 2xx Runnn Mean Latency 967 microsecs , P50 856 microsec , P99 1,50 msecs, Quarkus CPU usage 1,30% (1,3 cores)

HTTP 1.1 (100 connections) All operation answered with 2xx Mean latency 923 microsecs, P50 819 microsec, P99 2,49 msecs, Quarkus CPU usage 1,12% (1,12 cores)

Which benchmark tool are you using? Can you ellaborate on the kind of operations , latencies, etc? I have tried with other configs in terms of numbers connections, streams per connection, etc and HTTP 1.1. is always outperforming HTTP2 implementation in my benchmark.

At least in my recent experience, Vertx http2 stack is less efficient compared with HTTP 1.1. I have been using the vertx http proxy (https://vertx.io/docs/vertx-http-proxy/java/) module lately and when the proxy HTTP client is using HTTP2 the results are also significantly worse than using HTTP 1.1 (In this case the latencies are degraded heavily compared with HTTP 1.1).

Thanks,

Evaristo

franz1981 commented 11 months ago

It's difficult to compare the two protocols this way. Try to use a single I/O thread (and not any blocking thread pool in the request path) and configure the same number of physical connections for both and measure the peak throughput for both, to see what the maxim capacity.

You can than both try increase the number of streams, but beware, by definition this is prone to some queuing effects, because they will always be served from the same connection in the same I/O thread.

Vertex and Netty, without any specific configuration round robin assign physical connections among available I/O threads, while streams are served from the same physical connection.

And beware (ie I didn't checked really, what's the configuration for quarkus related number of I/O threads which can serve HTTP 2 - that's why I suggest to avoid any quirk related it).

Other suggestions: verify if all the configured connections are being used and how much, too.

chevaris commented 11 months ago

This so trivial Vertx app (Taken from https://vertx.io/docs/vertx-web-proxy/java/) uses more CPU and worses latencies at any number of TPSs with HTTP/2 than with HTTP 1.1 (at least with Hyperfoil)

HttpServer backendServer = vertx.createHttpServer();

Router backendRouter = Router.router(vertx);

backendRouter.route(HttpMethod.GET, "/foo").handler(rc -> { rc.response() .putHeader("content-type", "text/html") .end("

I'm the target resource!

"); });

backendServer.requestHandler(backendRouter).listen(7070);

franz1981 commented 11 months ago

Please try constraining the number of vertx cores to one and use the same number of physical connections to both protocols. If you prefer to use the default number of cores, check how the load is distributed among them (in term of CPU usage, instead of capturing the overall CPU consumption): I am not complaining against your results, but I suggest to provide other observations to your test to help understand if is an inherent limit of the way vertx/Netty implement the 2 protocols, some "default configuration quirk" of vertx/quarkus or just the result of wrong assumptions (or both) on how they should behave under what looks to be similar conditions.

And please provide the Hyperfoil yaml to replicate the test, to be sure we can perform the same test of you.

Is highly appreciated if you can collect profiling data using async profiler, possibly using -t option.

Adding @vietj in case he got something to share

chevaris commented 10 months ago

I still do NOT understand the benchmark that was referred in this ticket to say that HTTP/2 performance is better than HTTP 1.1. I do NOT doubt that in your benchmark HTTP/2 is better BUT It is NOT clear for me what you are testing (Operations, connections, latencies, etc) and what is the angle you are using to say that. Could you clarify the use case, units of the table , etc. I could say that you are using around 300 request per seconds with latencies around 50 - 100 msecs. Am I OK? Could you share your benchamrk files and the size aprox of the responses? I am assume that you are using really hughe documents or a very small amount of HW for the benchmark to have the figures in the table

My use case is very simple. 2 microservice communicating with HTTP REST APIs. Very simple / response protocol (No like browser with CSS; images, javascript, etc). Request are POST with very small JSON and responses are JSON around 4KB.

Regarding your suggestions: Why to use the same number of connections?

HTTP 1.1 was NOT designed to multiplex requests -> Pool of HTTP 1.1 connections with keepalive (and no pipelining)
HTTP 2 was design to multiplex request -> Pool of HTTP 2 connections and certain level of stream utilized per connection

I do NOT think that HTTP 1.1 with pipelining is the right choice to send requests NOT related between them due due to the ordering required by HTTP 1.1 pipelining (could make sense for browsers, BUT most of the browser are using actually a pool of HTTP1.1 connections.) Anyhow I tested it and result are also better than HTTP/2

The benchmark I am running shows that in order to communicate with a Quarkus micro, is more efficient using a big enough pool of HTTP 1.1 connections than using an smaller pool of fat HTTP 2 connections. I tried with multiple combinations of HTTP/2 streams and number of connections without any success (to discard that TCP flow control could be related). Here more efficient means less replicas of the microservice are needed to handle the same amount of load (and of top of that latencies are better).

I already reported the results from the profiling I did (with my limited capability and considering that I do NOT know the code) and I reported 3 things:

HTTP/2 in VErtx /Quarkus was using a suboptimal (correct from specs point of view) network flow (if I am NOT wrong that is the main issue that was triggered due to this ticket). I provided a comparison with other technologies (Spring Boot).
Vertx HTTP/2 is throwing thousands of exceptions per second (for life cycle of HTTP/2 streams) in the benchmark and JFR complains about that (@cescoffier commented that stack trace is avoided, so that should NOT be a problem according to him BUT I am NOT sure that using Exceptions for normal control flow is efficient). Vertx HTTP 1.1 is NOT doing that. You have a microbenchmark that shows that Exceptions are expensive (https://www.baeldung.com/java-exceptions-performance) even without stack trace. May is NOT relevant. I do NOT know.
HTTP/2 had a metric to consider that closing and HTTP stream as closed connection (that is misleading). Even when the metric is incrementing thousands of times per seconds I do NOT think that is making a real difference.

This is the hyperfoil file I used for HTTP/2

name: chevaConstantRate threads: 2 http: host: http://localhost:8080 sharedConnections: 10 allowHttp1x: false maxHttp2Streams: 100 ergonomics: # Disable stopping the scenario on 4xx or 5xx response autoRangeCheck: false phases:

cheva: constantRate: usersPerSec: 4000 duration: 180s maxDuration: 240s scenario:
- search:
  - httpRequest: POST: /cheva/v1/directory-data/query/id-214020110000899/00000/ headers: content-type: application/json x-cheva-ldapinfoquery: ldapOperation=SEARCH,identity=id-240810000000005 body: '{"ldapQueryContent": {"searchLdapQuery": {"dn": "id1=240810000001128,ou=jajaj","scope": "wholeSubtree","filter": "(objectClass=*)"}}}'

For HTTP 1.1 I replaced by

sharedConnections: 100 allowHttp1x: true

I also tried HTTP 1.1 (with pipelining) sharedConnections: 10 allowHttp1x: true pipeliningLimit: 100

Summary of the results:

Quarkus version: 3.6.3 - Openjdk 17.0.9 2 event loops (Quarkus by default is using 2 or more event loops) Benchmark tool Hyperfoil 0.24-2 (Also tried with hyperfoil 0.25-2) Benchmark test - Constant rate 4000 reqs / sec (POST requests in this case). Intel core i9 with Manjaro ( 6.1.68-1-MANJARO https://github.com/quarkusio/quarkus/pull/1 SMP PREEMPT_DYNAMIC Thu Dec 14 00:46:56 UTC 2023 x86_64 GNU/Linux)

Benchamrk running for 3 mins for each config (several warmup rounds)

HTTP/2 (10 connection / max 100 stream per connection) All operation answered with 2xx Mean Latency 967 microsecs , P50 856 microsec , P99 1,50 msecs, Quarkus CPU usage 1,30% (1,3 cores)

HTTP 1.1 (100 connections) All operation answered with 2xx Mean latency 923 microsecs, P50 819 microsec, P99 2,49 msecs, Quarkus CPU usage 1,12% (1,12 cores)

HTTP 1.1 pipelining limit (10 connections / 100 pipelining limit per connection) Very similar results that with HTTP 1.1 Mean latency 910 microsecs, P50 796 microsec, P99 2,44 msecs, Quarkus CPU usage 1,12% (1,12 cores)

Summary: In my use case HTTP 1.1 with pipelining is better that HTTP/2. Still NOT recommended because a single heavy call will delay other calls HTTP 1.1 with bigger pool of connections better that HTTP/2

franz1981 commented 10 months ago

Support of pipelining for Hyperfoil in http 1.1 is sadly broken (I have to fix it yet, given that I am a project committer), hence I suggests to ignore its results.

I still do NOT understand the benchmark that was referred in this ticket to say that HTTP/2 performance is better than HTTP 1.1

The results from @cjbooms seems to agree that they have degraded performance in Quarkus v3 at a point that the http 1.1 performance (rps) are better than http 2 (300 vs 240), which doesn't seem to disagree with your numbers: http 2 isn't faster in v3. I agree anyway that the use cases could be very different and not comparable, anyway.

Regarding your suggestions: Why to use the same number of connections?

Because the way Netty handle parallelism/concurrency with streams vs physical connections and the way HOL can bite the streams in case a single response isn't sent in one go, causing others to be queued up. The more physical connections the more real concurrency exist, unless Netty can chunk responses, allowing interleaving them. The reason why I was asking the number of cores, constraining to one, was to rule out that http 2 physical connections were correct assigned to different physical cores, granting some parallelism. Http 1.1 by default always do it, while with http 2 I am not sure (meaning; I don't know).

Returning on the topic: hope this week before Christmas to have a look at your reproducer and report any finding.

franz1981 commented 10 months ago

Please @cjbooms could you create a new issue which report just the comment about the http 2 performance degradation if compared to V2? I would like to keep those issues separated to save being confused while looking at both.

chevaris commented 10 months ago

Support of pipelining for Hyperfoil in http 1.1 is sadly broken (I have to fix it yet, given that I am a project committee).

I saw you in some tickets . It is a great tool !!!!

Related the exception performance issue, I agree indeed that is to be investigated, but looking at the past comments I see there is a change from Vertx which should fix that: are you still observing the same behaviour?

Tested with Quarkus 3.32 and 3.6.3 (Both same behaviour and heavy amount of exceptions)

franz1981 commented 10 months ago

It is a great tool !!!!

Thanks and happy you have used it!

Tested with Quarkus 3.32 and 3.6.3 (Both same behaviour and heavy amount of exceptions)

I should check if the changes are in (likely) and what they were meant to solve

vietj commented 10 months ago

for a better comparison between HTTP/1 and H2 I think you should lower the number of max concurrent stream per connection, specially if you are using a small number of H2 connection (10), instead increase the number of H2 connection and decrease the max number of concurrent stream, e.g. you could try 100 H2 connections with a max concurrent stream 10.

A small number of connection (compared to the number of cores) will put more load on some core than others, using more connections with small max stream tends to spread the load in a better way.

of course this is a recommendation for a benchmark.

franz1981 commented 10 months ago

@chevaris As mentioned by @vietj 's comment: a fair comparison for such synthetic case requires some test adjustment. In any case I've investigated about the difference of behavior between HTTP 1.1 and 2 for 2 types of "simple" test:

returning back a plaintext payload, super small
encoding a 4K json (with a very simple layout, just byte[])

And it has shows few low hanging fruits, reported at https://github.com/quarkusio/quarkus/issues/37835 and others, more complex, reported at https://github.com/eclipse-vertx/vert.x/pull/5047.

Currently the vertx 4.x branch already contains https://github.com/eclipse-vertx/vert.x/commit/841d6fb23a70d42c03bad4bfb2c5941fca212ca0 and https://github.com/eclipse-vertx/vert.x/commit/48459044f218ff90f5a019849521d87ba6a260e4 which already address some evident cost due to authority's validation, while other changes are still in progress in Netty (eg https://github.com/netty/netty/pull/13742) which improve header's lookup performance, which happen way more frequently than with http 1.1

Another low hanging fruit (although still not obvious to fix), is the pseudo-header lookup and validation cost in Http2's Netty which seems to happen too much and too often in the HPACK's decode paths (ie https://github.com/netty/netty/blob/427ace4cdd86cce84e467ec0a8a81ca022ed6f72/codec-http2/src/main/java/io/netty/handler/codec/http2/HpackDecoder.java#L559-L561) and visible at

While the last standing 3 differences are yet:

io/netty/handler/codec/http2/DefaultHttp2Connection$DefaultEndpoint.createStream
io/vertx/core/http/impl/Http2ServerConnection.createStream (which cost has already been halved by @vietj 's changed mentioned above)
io/netty/channel/CoalescingBufferQueue.remove (still under investigation)

Which make clear that the whole "stream" concept in Http 2 doesn't come for free and it has its costs, especially for cases as simple as these ones, but clearly some of the overhead could be removed.

Generally speaking the best we could improve directly within vertx (hence by consequence, for quarkus) has already been done for what was detected as a problem.

If you have the chance to compile vertx 4.x and run the experiment you're use to run, you can verify that things are getting in a better shape; it will take to roll a new release before the changes will be visibile to Quarkus, but it's a matter of time.

franz1981 commented 9 months ago

@chevaris @cjbooms update on this: I have further progressed into "fixing" the performance differences between HTTP 1.1 and 2 and found many others small/big changes, sent directly to Netty eg

Some already merged and others in the process of being reviewed. Additionally, others related a deficiency to scale eg https://github.com/netty/netty/pull/13741

My take on HTTP 2 is that, under realistic and correct usage, is a great protocol to reduce the required physical connections and improve the network usage (thanks to HPack caching/encoding), but in cases where:

the available network bandwidth is NOT a problem eg loopback, etc etc
the number of physical connections required (without using it) is not big enough

It adds an inherent cost of managing the streams, including distributing fairly their traffic, coalescing writes and creating them in the hot path, which makes HTTP 1.1 just faster, in its peak performance. This is specially true with pipelining, which in HTTP 1.1 doesn't have any infrastructure to handle concurrency, making it naturally prone to Head-Of-Line problems, but able to maximize throughput.

This has been a surprising fact to me, but it is what it is. Said that, we have addressed the most of the evident (and less evident) inneficiencies we have found (- and I have another couple in-flight) saving wasteful work and reducing (sometime dramatically) its CPU usage, but under the mentioned above conditions its peak performance won't be as good HTTP 1.1.

Just adding this, but take it with the grain of salt: the overall improvement in peak cpu saving has been around 35-40% applying all fixes to quarkus

chevaris commented 9 months ago

Really thanks a lot for the very detailed work on this and the support!!!!

I think you made a very good summary, and as you commented in some cases the stream handling could less performant that using extra connections. I actually got better results decreasing number of streams and using more connections as suggested here.

I think is really very good to see all the improvements coming making vertx /quarkus HTTP2 stack even better (It is already great compared with other options ). The more than I use it , the more than I like it.

quarkusio / quarkus

Improve HTTP/2 performance #34473

Description

Implementation ideas

I'm the target resource!