Closed franz1981 closed 1 year ago
/cc @FroMage (resteasy-reactive), @Sgitario (resteasy-reactive), @stuartwdouglas (resteasy-reactive)
Thanks for writing this up @franz1981.
What do you propose we do?
@geoand I'm preparing a PoC, but the ideas that come in my mind are:
ResteasyReactiveOutputStream
to hint it before Jackson will perform its stream::write
: this would enable the stream to be aware that no more data will be ever added (as the stream will be closed right after) and can make use to allocate a right-sized buffer (if within the configured outputBufferSize
)ResteasyReactiveOutputStream
is used in other cases where we don't know how many times we're going to use the stream to add more data, we loose the nice batching behaviour we have nowProbably you can help me understand how other streaming cases would use this stream to append chunks; if it can happen
I've learnt a bit more how ResteasyReactiveOutputStream
is used by BasicServerJacksonMessageBodyWriter
and UTF8JsonGenerator
: UTF8JsonGenerator
would flush its encoded content if it exceed 8000 bytes (that's the value in the second position on https://github.com/FasterXML/jackson-core/blob/31d4d85dcb25ca6699294db69591388470b9f8fc/src/main/java/com/fasterxml/jackson/core/util/BufferRecycler.java#L76), to keep on encoding.
if UTF8JsonGenerator
won't exceed such value, it would flush to the underlying output stream (ie ResteasyReactiveOutputStream
) the whole encoded content as a whole, on close (see https://github.com/FasterXML/jackson-core/blob/31d4d85dcb25ca6699294db69591388470b9f8fc/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java#L1227).
It means that we cannot rely on the fact that BasicServerJacksonMessageBodyWriter
would issue a single flush to ResteasyReactiveOutputStream
, because it depends by Jackson's buffer capacity as well. But is true as well that if Jackson perform more then one flush, we won't waste that much bytes by sizing the first chunk by 8K (that's the supposed amount of data Jackson would flush in case of multiple flushes) vs the default stream buffer size of 8191 bytes.
Probably you can help me understand how other streaming cases would use this stream to append chunks; if it can happen
In almost all cases, we know the full object which will be written (like in the case of Jackson, where we obviously know the POJO).
However, there is also the streaming data case (where a user returns a Multi<T>
from their method) in which case we have no idea how much data there will be.
Also the user can just directly use this stream.
Something I did notice you mentioned was this: 'Performing an allocation of 8K and assuming a I/O caller thread (aka the event loop), '
This is a blocking stream, it should not really be used by IO threads as it can block them?
@stuartwdouglas Yep, my point was to stress the case where Netty is using thread local buffers and simplify the example. I believe in the event loop it shouldn't block, I have to check..
@geoand @stuartwdouglas Proposal is at https://github.com/franz1981/quarkus/tree/append_buffer
Right now I got 3 failed tests using the
All tests on AppendBuffer::withExactChunks
version (and working on fixing it)reasteasy-reactive-vertx
work with AppendBuffer::withExactChunks
as well.
The idea is to have Jackson able to use a "special" buffer that is aware that the caller is already batchy and don't try to be "smart".
The append buffer I've designed can work in 3 different ways:
AppendBuffer::withExactChunks
: where every write accumulate buffers of exact requested size, unless reaching the max capacity, then caller must flush the current batch (and that's going to create a single composite ByteBuf for Vertx/Netty) to keep on appendingAppendBuffer::withEagerChunks
is the same as it is now ie buffers are always eagerly allocated of a given chunk size and will be flushed when no space is available anymoreI would use AppendBuffer::withExactChunks
with batchy appenders (a-la Jackson) that's using already some internal buffer to accumulate writes and AppendBuffer::withEagerChunks
elsewhere (to not impact existing use cases).
It's important for AppendBuffer::withExactChunks
to work at its best that the "batchy" caller maximize itself the component added to the append buffer by NOT using small batches - it has been designed with that in mind.
I didn't yet implemented any mechanism to release components in case of errors, but I would likely implement those in the append buffer itself, for simplicity - the original code was very well designed to save leak and I see no tests (I would add some to be sure to keep this nice feature).
Feedbacks are welcome before sending a PR
Can I propose a slightly different approach?
Most uses will be a single exact write, so withExactChunks is the correct approach, and even multiple larger writes it will work fine as long as the writes are then merged into a composite buffer so there is only a single writev call.
You might also want to consider a high/low water mark approach to max buffer size. E.g. once the buffer hits 8k we flush it, but if we have a single large write we can allocate up to 16k or more because we know it will immediately be sent to Netty.
It would also be good to know what the actual optimal chunk size is these days. When I was doing Undertow I got the best performance using 16kb write calls, but that was a long time ago, and all hardware and software involved is now massivly obsolete.
So to try and sum up all those points it would look something like:
Obviously all these numbers are made up, and should be replaced with numbers from hard evidence rather than my gut feeling.
On second thoughts the 'high water/low water' approach might not be the best idea, especially if your batch size is properly calibrated. It could result in a little bit of overflow being sent in its own packet instead of being batched.
I would propose testing both withExactChunks
and withEagerChunks
on some typical json output and then try to decide which strategy fits best for the common scenario (or see if we can adapt based on a pattern).
BTW, @franz1981 you probably also want to run the tests in extensions/resteasy-reactive/quarkus-resteasy-reactive/deployment
and extensions/resteasy-reactive/quarkus-resteasy-reactive-jackson/deployment
I would really like to get something along these lines in however, because as I understand it by what Franz has said from his preliminary investigations, this will lead to non trivial gains in terms of memory usage
@stuartwdouglas
To summarize, we have here a mix of different things we would like to do right I see:
writev
vs write
buffer size ie how much small the buffers can be in order to not make writev
inefficient for the linux kernel?size
s) while used by batchy callers - and by consequence, do not waste TLAB memory for small writes that cannot be batched (HTTP 1.1 without pipelining with "small" responses)In addition, I would add the behaviour of HTTP 11 (didn't checked for HTTP 2) at https://github.com/netty/netty/blob/9f10a284712bbd7f4f8ded7711a50bfaff38f567/codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java#L334
In short, if the previously allocated headers buffers (Netty uses a statistics to size them, based on past data points) are big enough to host the provided content (ie what quarkus is passing), it would write it there instead of passing it to the transport as it is ie potentially a composite buffer.
The point made by @stuartwdouglas is fair: right now the buffer is using a bi-modal logic where you can disable the limits in the single writes (as long as they fit into the over all total capacity offered - that's ~8K by default), but I can slightly change it in order to still have the notion of total capacity (which influence when an append
cause the caller to perform a flush -> and a syscall to write/writev, by consequence) but introduce the concept of minChunkSize
that's
used to decide what's the "minimal" (opposed to the current chunkSize
) capacity of the single chunks where to batch writes into.
If the incoming write is already bigger then minChunkSize
, it would use the exact one (till the overall capacity), while if smaller, it will still allocate minChunkSize
. to cope for "future" writes.
This should be able to work as per @stuartwdouglas suggestion and save having different buffer configurations while using Jackson vs the others (that's good).
The sole concern here (meaning that I gotta band my head on this a bit more), is that batching for a minimum of minChunkSize
will still underutilized the different TLABs available to Netty: I'll take a look better how they are constructed (in which classes of size) again. But still: if we batch we would save buffer instances as well, that will have
an on-heap impact too (and likely saving composite buffers in the super-happy path, that's great, really).
I've added a further commit to modify the behavior as suggested by @stuartwdouglas at https://github.com/franz1981/quarkus/commit/f428f937593c3b3cbaf6dbead6d71887129d7540 (same branch as before, but new commit): right now is using a magic value of 128
ie 2 cache lines, but we can choose something more appropriate in case.
I've left the option of a fully caller-driven chunk allocation version ie withExactChunks
because I still believe that in cases where we can trust a batchy caller, there's no need to try being smart, but TBH if I made the check on the last buffer capacity left cheap, it shouldn't be needed (that's why I've introduced this option too)
I am now checking how the size classes of Netty are composed and searching on https://www.kernel.org/doc/html/latest/index.html if I can find anything interesting re writev
suggested buffer size
Something else to consider if we want absolute maximum performance is that the first write will also have the headers, while subsequent writes won't (although they may have chunking data).
Does Netty handle this sort of batching optimisation? Maybe a Netty handler is actually the correct place because you can just deal with fixed size buffers without needing to know about headers etc.
My gut feeling is that you would want to batch close to the MTU size so that every writev call maps to a single packet.
Does Netty handle this sort of batching optimisation?
The netty http 1.1 encoder is smart enough to try estimating the http headers based on a statistics of the previously sent one, allocating a single buffer in one go - and if the payload (non-chunked) is small enough, can be usually written directly into it (due to padding for alignments of the Netty allocator). Hence, Netty try hard at the handler level to save having indivitual (and small) buffers to deal with. Vertx as well try hard to flush very late (but without pipelining this is sadly just a latency delay here). I can tell that we try to be batchy at any level of the stack. But without any hard constrain on the size of such batches; just "moar is better" let's say.
My gut feeling is that you would want to batch close to the MTU size so that every writev call maps to a single packet.
That's more a problem of the overall capacity then the single chunk (that's the hidden detail that can cause composite buffers to be used), which cost is mostly due to:
if we batch enough, even by exceeding the MTU, we guarantee decent packet utilization (we amortize the single TCP header packet cost by putting enough payload), even with fragmentation.
I don't want again to define a "wrong" minChunkSize that just makes allocations not efficient due to not correctly using Netty TLABs: is important to remember that missing TLAB allocations would cause using the shared arenas of the allocator and impacting (if unlucky) other threads (I/O and not) then causing a small scalability issue. This was the first reason to just use whatever was the requested required capacity and just let Netty "do its thing" - said that, I agree that we should find a proper value to save obviously bad things to happen (sub-100 B or 1K allocations). I hope next week to send a draft pr with some numbers in
Description
quarkus.resteasy-reactive.output-buffer-size
(see quarkus.resteasy-reactive.output-buffer-size's doc) seems to not just control when responses should contain theContent-Length
header and being chunked, but the initial eagerly allocated capacity of the buffer used to stream content over the connection per response as well, see https://github.com/quarkusio/quarkus/blob/f6851e3371bf3b28c394a186964b1625f8be5ce2/independent-projects/resteasy-reactive/server/vertx/src/main/java/org/jboss/resteasy/reactive/server/vertx/ResteasyReactiveOutputStream.java#L208This "eager" behavior, despite seems optimal while simple (because it allows to perform optimistic batching) can be problematic for the Netty allocator, due to how it works under the hood. Adding more notes below, to explain the effects/consequences.
NOTE: the default buffer size is
8191
byteseg
Performing an allocation of 8K and assuming a I/O caller thread (aka the event loop), in a default configured Netty scenario, is going to use the netty
PoolThreadCache
at https://github.com/netty/netty/blob/c353f4fea52559d09b3811492c92a38aa1887501/buffer/src/main/java/io/netty/buffer/PoolThreadCache.java#L299-L303 ie a so-calledsmall
direct pooled allocation.The cache itself is organized in order to have few
MemoryRegionCache
s (in an array) chosen based on the normalized required capacity; in this case it's going to use the one at thesizeIdx = 31
, obtained by (https://github.com/netty/netty/blob/eb3feb479826949090a9a55c782722ece9b42e50/buffer/src/main/java/io/netty/buffer/SizeClasses.java#L317)[SizeClasses::size2SizeIdx].Every
MemoryRegionCache
has a finite number of pooled (and thread local) buffers ieio.netty.allocator.smallCacheSize
, which is 256 by default.Such pooled and thread local instances are not filled into
MemoryRegionCache
till they are used/allocated the very first time (on demand) - it means that in the worst case; given that we always allocate 8K chunks, we risk to have a (8K * 256) memory footprint per even loop thread, while idle, if a previous run has caused them to be fully allocated (maybe because of 256 concurrent and slow connections running on the same event loop thread).In short: the problem is not the allocation cost, because, in the happy path is a cached already-happened thread-local allocation, but is a matter of idle memory footprint and bad utilization of the existing caching behavior provided by Netty: if we could use different sized allocations here, we can make uses of all the different
sizeIdx
that Netty provide, possibly reducing the idle utilization as well due to retaining just the effectively used ones (or near to what has been the effective usage). In addition, using manyMemoryRegionCache
entries, reduce the chances of un-happy paths, because we are not bounded just to a singleMemoryRegionCache
capacity (that's fairly low, as said - just 256 buffers).Implementation ideas
https://github.com/franz1981/quarkus/tree/append_buffer