Closed adamw closed 7 months ago
We've recently migrated one of our services from Akka HTTP to Tapir. Sharing some insights. The service itself is relatively simple: the majority of requests are GETs that trigger single record fetching from underlying MySQL database, the SQL query itself is a simple SELECT by primary key from a single table, the response size is just a few kilobytes.
Initially I've migrated the service to use Armeria as Tapir's backend.
Upon the deployment the latencies changed like this:
Average latency increased by 5 ms
P99 latency increased by 8-10 ms
The traffic to the service in terms of request count looks like this:
CPU usage slightly increased as well (the service consists of two replicas using 1 CPU core each, 512MB of heap):
It wasn't clear to me whether the increase in latency comes from the fact we switched to Armeria, or whether it's Tapir's overhead. So I've decided to check the performance when using Akka HTTP as Tapir's backend.
So next I deployed the service version based on Tapir/Akka HTTP stack. These are p99 latencies we've got before/after this deployment:
As you can see, p99 latencies are practically the same when using Tapir with either Armeria or Akka HTTP as backends. The situation is similar for avg latencies.
Based on the above I can make a conclusion that slight latency increase comes from overhead of Tapir being the wrapper over actual HTTP server implementation.
This slight overhead (5-10 ms) is certainly something that we can live with in our case, taking into account the other benefits provided by Tapir (namely pluggable HTTP backends, out of the box support for Open API specs/Swagger UI, Scala 3, etc).
Hopefully, this information might be useful for any potential performance improvements.
Personally I like the way our endpoint definitions look now after migrating from Akka HTTP to Tapir, although it took a bit of time to switch my mindset from Akka HTTP route definitions to Tapir API.
Thank you for your work
!!!UPDATE!!!
Eventually turned out that the latency increase was caused not by Tapir overhead, but by additional authorization call backed by SQL query that I introduced during the migration and didn't pay attention to when comparing the performance.
@thereisnospoon Thanks a lot for the detailed investigation, very helpful! Would be definitely great to shave off some of that latency - maybe there are some opportunities around request decoding, but that's just a guess, we'll have to profile first.
Some updates on my previous comment:
It slipped my attention that when I was migrating from Akka HTTP to Tapir I added an additional authorization call to the most used service endpoint, which implies additional SQL query to MySQL. So this is the actual reason of subtle performance degradation.
I've tried temporary disabling the added authorization logic to check the performance, and now it looks like the performance is back to pre-migration level. For example, average query latency to the API is around 8 ms, practically the same as it was:
I will update my initial comment to indicate that the performance degradation was not caused by Tapir.
Also in any case it's great to know that there is practically no performance overhead from Tapir, at least in the case of this relatively simple API of ours.
@thereisnospoon Ha that's a nice surprise :) Doesn't mean we can't be even better ;) Thanks again for the use-case!
@adamw As we discussed today, we'd like to work some more on Tapir performance tests. Before specifying concrete scenarios that are important in the first place, let's make sure we agree on base goals. My proposition:
@kciesielski I think that's a very good high-level plan. One adjustment:
@adamw A proposition of scenarios/backends to start with:
Simple GET latency
Current test Concurrent users calling dummy GET endpoints. Rationale: A simple test which allows to quickly check overhead on processing time with minimal Tapir involvement. Any deviation exceeding ~10 milliseconds shouldn't be expected. Endpoint specifications:
Simple raw input latency (small input)
Dummy POST endpoints with String/ByteArray input. Backends: Netty Future, Netty Cats, Netty ZIO, http4s, Pekko, vert.x Why: See https://github.com/softwaremill/tapir/issues/3367 Netty is our candidate for a main backend, where we expect minimal overhead. We're interested in the real impact of current approach for raw request body processing with reactive streams. For comparison, we want to run this scenario with other leading backends like http4s. Measurements:
Simple raw input latency (5MB input)
Similar to previous test, but with larger input to see what's the overhead of Tapir putting chunks together in some servers. Dummy POST endpoints with String/ByteArray input. Backends: Netty Future, Netty Cats, Netty ZIO, http4s, Pekko, vert.x Why: See https://github.com/softwaremill/tapir/issues/3367 Netty is our candidate for a main backend, where we expect minimal overhead. We're interested in the real impact of current approach for raw request body processing with reactive streams. For comparison, we want to run this scenario with other leading backends like http4s. Measurements:
Raw File input latency
Dummy POST endpoints with raw File input. Backends: Netty Future, Netty Cats, Netty ZIO, http4s, Pekko, vert.x Rationale: See https://github.com/softwaremill/tapir/issues/3367 Similar to "Simple POST raw body latency", this test will measure how long it takes to write a file provided with raw input. Writing files is implemented using a custom reactive subscriber in Netty Future server, while servers like http4s, netty-cats, netty-zio, etc use streaming integrations from the libraries. It would be good to compare these implementations. Measurements:
Websockets latency
Based on https://github.com/kamilkloch/websocket-benchmark/tree/master Server exposes a single /ts websocket endpoint, which emits a timestamp every 500ms. Rationale: Inefficiency in Tapir WS handling for http4s was measured to cause tail latency of hundreds of milliseconds. A fix has been provided by Kamil Kloch, which brought the values very close to raw http4s. We want to keep a test for other implementations and future reference. Measurements:
Looks good :)
Simple GET latency -> this would also involve comparing to no-tapir setups? http4s and pekko are easy, netty - probably not so much?
Simple raw input latency -> what would be the endpoint definition - reading the entire request into memory (as a string?), or using some kind of streaming (reactive / input stream)?
Raw File input latency -> not sure if the file isn't too small to measure the overhead of reading & writing multiple chunks?
Websockets latency -> 500ms sounds like a no-brainer - if we're slower than that, then we've got a serious problem ;). Maybe some kind of continuous transmission, or a skewed ping-pong, where our server sends e.g. 100 messages, and waits for a reply, and we see how many roundtrips we manage to make?
One more test that would be good to have, is including various interceptors (exception, logging, metrics) - that would help resolve https://github.com/softwaremill/tapir/issues/3272
Simple GET latency -> this would also involve comparing to no-tapir setups? http4s and pekko are easy, netty - probably not so much?
Yes, the no-tapir setups are in the current perf test set, so we should compare them. It's indeed not easy to build a comprehensive no-tapir netty server. One more server we could add is zio-http, which is netty-based.
Simple raw input latency -> what would be the endpoint definition - reading the entire request into memory (as a string?), or using some kind of streaming (reactive / input stream)?
I thought about using a raw string our byte array, which, in case of netty backends, will be built from a reactive stream that reads input in chunks to create the full final request body in memory. First, I'd like to test it with small input, something that's a typical raw input like a JSON of few hundred bytes. This would run the entire streaming machinery for one iteration, because chunk size is 8192 bytes, and that's something that might have overhead worth comparing with other backends.
Raw File input latency -> not sure if the file isn't too small to measure the overhead of reading & writing multiple chunks?
Chunk size is 8192 by default as well, at least for our reactive streams, fs2 and zio streams, so 512kB sounds good enough.
Websockets: Maybe some kind of continuous transmission, or a skewed ping-pong, where our server sends e.g. 100 messages, and waits for a reply, and we see how many roundtrips we manage to make?
Sounds good.
One more test that would be good to have, is including various interceptors
Yes, thanks for mentioning that, we definitely should add this as well.
Yes, the no-tapir setups are in the current perf test set, so we should compare them. It's indeed not easy to build a comprehensive no-tapir netty server. One more server we could add is zio-http, which is netty-based.
Probably not much sense investing too much into pure-netty server :) Not sure if it makes sense to compare w/ zio-http - it's in a lot of flux, so not sure how polished it is. Maybe pekko-http will provide a good enought baseline? As a server, it's rather fast :) Otherwise, we might look at vertx - it's one of the fastest servers out there AFAIK.
Chunk size is 8192 by default as well, at least for our reactive streams, fs2 and zio streams, so 512kB sounds good enough.
Ah ok :) So we would test with small inputs (1 chunk), and large inputs (~60 chunks) - both for file & string/byte array? Still not sure if 60 chunks will exhibit any significant overhead that might be there.
Maybe pekko-http will provide a good enought baseline? As a server, it's rather fast :) Otherwise, we might look at vertx - it's one of the fastest servers out there AFAIK.
Sounds good. Also, good to know that vertx is fast, I'll check it too it then.
Still not sure if 60 chunks will exhibit any significant overhead that might be there.
Maybe we should use much larger files, like hunderds of chunks, but with less requests and shorter time, so we don't run out of disk space in seconds? For example: 640 chunks = 5MB, 20 concurrent users, 512 requests per user, which gives 50 GB.
Maybe we should use much larger files, like hunderds of chunks, but with less requests and shorter time, so we don't run out of disk space in seconds? For example: 640 chunks = 5MB, 20 concurrent users, 512 requests per user, which gives 50 GB.
Hm ... well it's also a question, do we want to test the server under load (many concurrent requests), or the latency of a single request. I think if you want to look at the overhead of our stream processing logic, looking at a single request would be more informative - you isolate the aspect you want to test more.
This might also be true for the other tests, as tapir code isn't really involved in concurrency (but only "adds overhead" to sequential processing of a request into a response). One aspect that might have impact under high load, that we won't be able to measure looking at single requests, is increased memory pressure - how much garbage we produce, and how much additional latency collecting this garbage creates
Let's see, we could achieve this by parameterizing scnearios with concurrent user count. Run with nUsers=1
and a lot of requests to isolate single request processing and measure latency footpring, or run with nUsers > 1
for cases where we want to check if concurrency affects latency and how the memory behaves, wdyt?
Let's see, we could achieve this by parameterizing scnearios with concurrent user count. Run with nUsers=1 and a lot of requests to isolate single request processing and measure latency footpring, or run with nUsers > 1 for cases where we want to check if concurrency affects latency and how the memory behaves, wdyt?
True, if we have a test harness where we can easily parametrize these tests, run and look at results, why not run both :)
First part of investigation https://softwaremill.com/benchmarking-tapir-part-1/
Second part: https://softwaremill.com/benchmarking-tapir-part-2/. I'm closing this issue, as the investigation part is pretty much done.
In addition to the throughput tests we've conducted using akka-http & http4s (in
perf-tests
), inspect how using tapir influences memory & CPU usage, as compared to using these servers directly. Also, take a look at latency under sustained load at a fixed req/s rate.Source: https://twitter.com/aplokhotnyuk/status/1603342821594890247