Investigate tapir's performance overhead

adamw commented 1 year ago

In addition to the throughput tests we've conducted using akka-http & http4s (in perf-tests), inspect how using tapir influences memory & CPU usage, as compared to using these servers directly. Also, take a look at latency under sustained load at a fixed req/s rate.

Source: https://twitter.com/aplokhotnyuk/status/1603342821594890247

thereisnospoon commented 1 year ago

We've recently migrated one of our services from Akka HTTP to Tapir. Sharing some insights. The service itself is relatively simple: the majority of requests are GETs that trigger single record fetching from underlying MySQL database, the SQL query itself is a simple SELECT by primary key from a single table, the response size is just a few kilobytes.

Initially I've migrated the service to use Armeria as Tapir's backend.

~~Upon the deployment the latencies changed like this:~~

~~Average latency increased by 5 ms~~

~~P99 latency increased by 8-10 ms~~

The traffic to the service in terms of request count looks like this:

~~CPU usage slightly increased as well (the service consists of two replicas using 1 CPU core each, 512MB of heap):~~

It wasn't clear to me whether the increase in latency comes from the fact we switched to Armeria, or whether it's Tapir's overhead. So I've decided to check the performance when using Akka HTTP as Tapir's backend.

~~So next I deployed the service version based on Tapir/Akka HTTP stack. These are p99 latencies we've got before/after this deployment:~~

~~As you can see, p99 latencies are practically the same when using Tapir with either Armeria or Akka HTTP as backends. The situation is similar for avg latencies.~~

~~Based on the above I can make a conclusion that slight latency increase comes from overhead of Tapir being the wrapper over actual HTTP server implementation.~~

This slight overhead (5-10 ms) is certainly something that we can live with in our case, taking into account the other benefits provided by Tapir (namely pluggable HTTP backends, out of the box support for Open API specs/Swagger UI, Scala 3, etc).

~~Hopefully, this information might be useful for any potential performance improvements.~~

Personally I like the way our endpoint definitions look now after migrating from Akka HTTP to Tapir, although it took a bit of time to switch my mindset from Akka HTTP route definitions to Tapir API.

Thank you for your work

!!!UPDATE!!!

Eventually turned out that the latency increase was caused not by Tapir overhead, but by additional authorization call backed by SQL query that I introduced during the migration and didn't pay attention to when comparing the performance.

adamw commented 1 year ago

@thereisnospoon Thanks a lot for the detailed investigation, very helpful! Would be definitely great to shave off some of that latency - maybe there are some opportunities around request decoding, but that's just a guess, we'll have to profile first.

thereisnospoon commented 1 year ago

Some updates on my previous comment:

It slipped my attention that when I was migrating from Akka HTTP to Tapir I added an additional authorization call to the most used service endpoint, which implies additional SQL query to MySQL. So this is the actual reason of subtle performance degradation.

I've tried temporary disabling the added authorization logic to check the performance, and now it looks like the performance is back to pre-migration level. For example, average query latency to the API is around 8 ms, practically the same as it was:

I will update my initial comment to indicate that the performance degradation was not caused by Tapir.

Also in any case it's great to know that there is practically no performance overhead from Tapir, at least in the case of this relatively simple API of ours.

adamw commented 1 year ago

@thereisnospoon Ha that's a nice surprise :) Doesn't mean we can't be even better ;) Thanks again for the use-case!

kciesielski commented 10 months ago

@adamw As we discussed today, we'd like to work some more on Tapir performance tests. Before specifying concrete scenarios that are important in the first place, let's make sure we agree on base goals. My proposition:

Environment consistency We need to specify the hardware and software environments where the tests will be run. Consider using containerization tools like Docker to standardize the test environment. This helps in achieving repeatability and fair comparisons.
Simplicity of execution Automate the test execution process as much as possible. This can include automatic setup of test environments, execution of tests, and teardown. Current idea of starting gatling with sbt commands is a good approach.
Documentation Ensure clarity and reproducibility of tests. Document the test methodology, including how to set up and run tests, how to interpret results, and the rationale behind each test scenario and selected metrics.
Clear reporting Results should be presented in a clear way as a report, which doesn't require browsing logs. It would also be good to avoid unnecessary information in final test reports, if possible.
Possible comparisons It may be beneficial to set up similar tests with "tapirless" configuration, like pure http4s, pure akka-http, etc. to compare the results.

adamw commented 10 months ago

@kciesielski I think that's a very good high-level plan. One adjustment:

I'm not sure if we should strive to have pre-defined environments. Won't it be sufficient if we have a test suite which we can run on a given CI servers / computer, and compare the "local" results? E.g. we can have a "baseline" server which we know that performs well as part of the suite (sth like akka-http, or even not-scala-based), and the relative differences can then give us the insights we need.
Definitely! That's one of our goals, to see what kind of overhead tapir imposes (there must be some, as we provide some additional features on top of these servers, and they are definitely not zero-cost abstractions)

kciesielski commented 10 months ago

Good point, the way we're going to test Tapir is specific. For most cases we are aiming to compare latencies between backends, and we can easily achieve a reference point without requiring a detailed constant environment. It's easy to get a "fast" server running locally, where latencies are on a reasonable level, then compare differences between different backends or configurations.

kciesielski commented 10 months ago

@adamw A proposition of scenarios/backends to start with:

Simple GET latency

Current test Concurrent users calling dummy GET endpoints. Rationale: A simple test which allows to quickly check overhead on processing time with minimal Tapir involvement. Any deviation exceeding ~10 milliseconds shouldn't be expected. Endpoint specifications:

no additional interceptors, raw string response body Measurements:
latency Backends:
All netty backends, http4s, pekko, vert.x

Simple raw input latency (small input)

Dummy POST endpoints with String/ByteArray input. Backends: Netty Future, Netty Cats, Netty ZIO, http4s, Pekko, vert.x Why: See https://github.com/softwaremill/tapir/issues/3367 Netty is our candidate for a main backend, where we expect minimal overhead. We're interested in the real impact of current approach for raw request body processing with reactive streams. For comparison, we want to run this scenario with other leading backends like http4s. Measurements:

latency (reported by gatling)
memory usage (manually in async profiler?)

Simple raw input latency (5MB input)

Similar to previous test, but with larger input to see what's the overhead of Tapir putting chunks together in some servers. Dummy POST endpoints with String/ByteArray input. Backends: Netty Future, Netty Cats, Netty ZIO, http4s, Pekko, vert.x Why: See https://github.com/softwaremill/tapir/issues/3367 Netty is our candidate for a main backend, where we expect minimal overhead. We're interested in the real impact of current approach for raw request body processing with reactive streams. For comparison, we want to run this scenario with other leading backends like http4s. Measurements:

latency (reported by gatling)
memory usage (manually in async profiler?) Additional details: 5MB is 640 chunks x 8192 (default chunk size)

Raw File input latency

Dummy POST endpoints with raw File input. Backends: Netty Future, Netty Cats, Netty ZIO, http4s, Pekko, vert.x Rationale: See https://github.com/softwaremill/tapir/issues/3367 Similar to "Simple POST raw body latency", this test will measure how long it takes to write a file provided with raw input. Writing files is implemented using a custom reactive subscriber in Netty Future server, while servers like http4s, netty-cats, netty-zio, etc use streaming integrations from the libraries. It would be good to compare these implementations. Measurements:

latency (reported by gatling)
memory usage (manually in async profiler?) Additional details: File size: 5MB (640 chunks * 8192B) 20 concurrent users, 512 requests per user, which gives 50 GB stored to disk.

Websockets latency

Based on https://github.com/kamilkloch/websocket-benchmark/tree/master Server exposes a single /ts websocket endpoint, which emits a timestamp every 500ms. Rationale: Inefficiency in Tapir WS handling for http4s was measured to cause tail latency of hundreds of milliseconds. A fix has been provided by Kamil Kloch, which brought the values very close to raw http4s. We want to keep a test for other implementations and future reference. Measurements:

latency Backends: Those which support websockets: http4s, pekko, Akka, vert.x, Play

adamw commented 10 months ago

Looks good :)

Simple GET latency -> this would also involve comparing to no-tapir setups? http4s and pekko are easy, netty - probably not so much?

Simple raw input latency -> what would be the endpoint definition - reading the entire request into memory (as a string?), or using some kind of streaming (reactive / input stream)?

Raw File input latency -> not sure if the file isn't too small to measure the overhead of reading & writing multiple chunks?

Websockets latency -> 500ms sounds like a no-brainer - if we're slower than that, then we've got a serious problem ;). Maybe some kind of continuous transmission, or a skewed ping-pong, where our server sends e.g. 100 messages, and waits for a reply, and we see how many roundtrips we manage to make?

adamw commented 9 months ago

One more test that would be good to have, is including various interceptors (exception, logging, metrics) - that would help resolve https://github.com/softwaremill/tapir/issues/3272

kciesielski commented 9 months ago

Simple GET latency -> this would also involve comparing to no-tapir setups? http4s and pekko are easy, netty - probably not so much?

Yes, the no-tapir setups are in the current perf test set, so we should compare them. It's indeed not easy to build a comprehensive no-tapir netty server. One more server we could add is zio-http, which is netty-based.

Simple raw input latency -> what would be the endpoint definition - reading the entire request into memory (as a string?), or using some kind of streaming (reactive / input stream)?

I thought about using a raw string our byte array, which, in case of netty backends, will be built from a reactive stream that reads input in chunks to create the full final request body in memory. First, I'd like to test it with small input, something that's a typical raw input like a JSON of few hundred bytes. This would run the entire streaming machinery for one iteration, because chunk size is 8192 bytes, and that's something that might have overhead worth comparing with other backends.

Raw File input latency -> not sure if the file isn't too small to measure the overhead of reading & writing multiple chunks?

Chunk size is 8192 by default as well, at least for our reactive streams, fs2 and zio streams, so 512kB sounds good enough.

Websockets: Maybe some kind of continuous transmission, or a skewed ping-pong, where our server sends e.g. 100 messages, and waits for a reply, and we see how many roundtrips we manage to make?

Sounds good.

One more test that would be good to have, is including various interceptors

Yes, thanks for mentioning that, we definitely should add this as well.

adamw commented 9 months ago

Yes, the no-tapir setups are in the current perf test set, so we should compare them. It's indeed not easy to build a comprehensive no-tapir netty server. One more server we could add is zio-http, which is netty-based.

Probably not much sense investing too much into pure-netty server :) Not sure if it makes sense to compare w/ zio-http - it's in a lot of flux, so not sure how polished it is. Maybe pekko-http will provide a good enought baseline? As a server, it's rather fast :) Otherwise, we might look at vertx - it's one of the fastest servers out there AFAIK.

Chunk size is 8192 by default as well, at least for our reactive streams, fs2 and zio streams, so 512kB sounds good enough.

Ah ok :) So we would test with small inputs (1 chunk), and large inputs (~60 chunks) - both for file & string/byte array? Still not sure if 60 chunks will exhibit any significant overhead that might be there.

kciesielski commented 9 months ago

Maybe pekko-http will provide a good enought baseline? As a server, it's rather fast :) Otherwise, we might look at vertx - it's one of the fastest servers out there AFAIK.

Sounds good. Also, good to know that vertx is fast, I'll check it too it then.

Still not sure if 60 chunks will exhibit any significant overhead that might be there.

Maybe we should use much larger files, like hunderds of chunks, but with less requests and shorter time, so we don't run out of disk space in seconds? For example: 640 chunks = 5MB, 20 concurrent users, 512 requests per user, which gives 50 GB.

adamw commented 9 months ago

Maybe we should use much larger files, like hunderds of chunks, but with less requests and shorter time, so we don't run out of disk space in seconds? For example: 640 chunks = 5MB, 20 concurrent users, 512 requests per user, which gives 50 GB.

Hm ... well it's also a question, do we want to test the server under load (many concurrent requests), or the latency of a single request. I think if you want to look at the overhead of our stream processing logic, looking at a single request would be more informative - you isolate the aspect you want to test more.

This might also be true for the other tests, as tapir code isn't really involved in concurrency (but only "adds overhead" to sequential processing of a request into a response). One aspect that might have impact under high load, that we won't be able to measure looking at single requests, is increased memory pressure - how much garbage we produce, and how much additional latency collecting this garbage creates

kciesielski commented 9 months ago

Let's see, we could achieve this by parameterizing scnearios with concurrent user count. Run with nUsers=1 and a lot of requests to isolate single request processing and measure latency footpring, or run with nUsers > 1 for cases where we want to check if concurrency affects latency and how the memory behaves, wdyt?

adamw commented 9 months ago

Let's see, we could achieve this by parameterizing scnearios with concurrent user count. Run with nUsers=1 and a lot of requests to isolate single request processing and measure latency footpring, or run with nUsers > 1 for cases where we want to check if concurrency affects latency and how the memory behaves, wdyt?

True, if we have a test harness where we can easily parametrize these tests, run and look at results, why not run both :)

kciesielski commented 7 months ago

First part of investigation https://softwaremill.com/benchmarking-tapir-part-1/

kciesielski commented 7 months ago

Second part: https://softwaremill.com/benchmarking-tapir-part-2/. I'm closing this issue, as the investigation part is pretty much done.

softwaremill / tapir

Investigate tapir's performance overhead #2636