ResteasyReactiveOutputStream could make better use of the Netty (direct) pooled allocator

franz1981 commented 1 year ago

Description

quarkus.resteasy-reactive.output-buffer-size (see quarkus.resteasy-reactive.output-buffer-size's doc) seems to not just control when responses should contain the Content-Length header and being chunked, but the initial eagerly allocated capacity of the buffer used to stream content over the connection per response as well, see https://github.com/quarkusio/quarkus/blob/f6851e3371bf3b28c394a186964b1625f8be5ce2/independent-projects/resteasy-reactive/server/vertx/src/main/java/org/jboss/resteasy/reactive/server/vertx/ResteasyReactiveOutputStream.java#L208

This "eager" behavior, despite seems optimal while simple (because it allows to perform optimistic batching) can be problematic for the Netty allocator, due to how it works under the hood. Adding more notes below, to explain the effects/consequences.

NOTE: the default buffer size is 8191 bytes

eg

Performing an allocation of 8K and assuming a I/O caller thread (aka the event loop), in a default configured Netty scenario, is going to use the netty PoolThreadCache at https://github.com/netty/netty/blob/c353f4fea52559d09b3811492c92a38aa1887501/buffer/src/main/java/io/netty/buffer/PoolThreadCache.java#L299-L303 ie a so-called small direct pooled allocation.

The cache itself is organized in order to have few MemoryRegionCaches (in an array) chosen based on the normalized required capacity; in this case it's going to use the one at the sizeIdx = 31, obtained by (https://github.com/netty/netty/blob/eb3feb479826949090a9a55c782722ece9b42e50/buffer/src/main/java/io/netty/buffer/SizeClasses.java#L317)[SizeClasses::size2SizeIdx].

Every MemoryRegionCache has a finite number of pooled (and thread local) buffers ie io.netty.allocator.smallCacheSize , which is 256 by default.

Such pooled and thread local instances are not filled into MemoryRegionCache till they are used/allocated the very first time (on demand) - it means that in the worst case; given that we always allocate 8K chunks, we risk to have a (8K * 256) memory footprint per even loop thread, while idle, if a previous run has caused them to be fully allocated (maybe because of 256 concurrent and slow connections running on the same event loop thread).

In short: the problem is not the allocation cost, because, in the happy path is a cached already-happened thread-local allocation, but is a matter of idle memory footprint and bad utilization of the existing caching behavior provided by Netty: if we could use different sized allocations here, we can make uses of all the different sizeIdx that Netty provide, possibly reducing the idle utilization as well due to retaining just the effectively used ones (or near to what has been the effective usage). In addition, using many MemoryRegionCache entries, reduce the chances of un-happy paths, because we are not bounded just to a single MemoryRegionCache capacity (that's fairly low, as said - just 256 buffers).

Implementation ideas

https://github.com/franz1981/quarkus/tree/append_buffer

quarkus-bot[bot] commented 1 year ago

/cc @FroMage (resteasy-reactive), @Sgitario (resteasy-reactive), @stuartwdouglas (resteasy-reactive)

geoand commented 1 year ago

Thanks for writing this up @franz1981.

What do you propose we do?

franz1981 commented 1 year ago

@geoand I'm preparing a PoC, but the ideas that come in my mind are:

given that we know when we are interested into writing and close the stream right after, we could add an API method to ResteasyReactiveOutputStream to hint it before Jackson will perform its stream::write: this would enable the stream to be aware that no more data will be ever added (as the stream will be closed right after) and can make use to allocate a right-sized buffer (if within the configured outputBufferSize )
we can think about a general purpose allocation method, but if ResteasyReactiveOutputStream is used in other cases where we don't know how many times we're going to use the stream to add more data, we loose the nice batching behaviour we have now

Probably you can help me understand how other streaming cases would use this stream to append chunks; if it can happen

franz1981 commented 1 year ago

I've learnt a bit more how ResteasyReactiveOutputStream is used by BasicServerJacksonMessageBodyWriter and UTF8JsonGenerator: UTF8JsonGenerator would flush its encoded content if it exceed 8000 bytes (that's the value in the second position on https://github.com/FasterXML/jackson-core/blob/31d4d85dcb25ca6699294db69591388470b9f8fc/src/main/java/com/fasterxml/jackson/core/util/BufferRecycler.java#L76), to keep on encoding. if UTF8JsonGenerator won't exceed such value, it would flush to the underlying output stream (ie ResteasyReactiveOutputStream) the whole encoded content as a whole, on close (see https://github.com/FasterXML/jackson-core/blob/31d4d85dcb25ca6699294db69591388470b9f8fc/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java#L1227).

It means that we cannot rely on the fact that BasicServerJacksonMessageBodyWriter would issue a single flush to ResteasyReactiveOutputStream, because it depends by Jackson's buffer capacity as well. But is true as well that if Jackson perform more then one flush, we won't waste that much bytes by sizing the first chunk by 8K (that's the supposed amount of data Jackson would flush in case of multiple flushes) vs the default stream buffer size of 8191 bytes.

geoand commented 1 year ago

Probably you can help me understand how other streaming cases would use this stream to append chunks; if it can happen

In almost all cases, we know the full object which will be written (like in the case of Jackson, where we obviously know the POJO). However, there is also the streaming data case (where a user returns a Multi<T> from their method) in which case we have no idea how much data there will be.

stuartwdouglas commented 1 year ago

Also the user can just directly use this stream.

Something I did notice you mentioned was this: 'Performing an allocation of 8K and assuming a I/O caller thread (aka the event loop), '

This is a blocking stream, it should not really be used by IO threads as it can block them?

franz1981 commented 1 year ago

@stuartwdouglas Yep, my point was to stress the case where Netty is using thread local buffers and simplify the example. I believe in the event loop it shouldn't block, I have to check..

franz1981 commented 1 year ago

@geoand @stuartwdouglas Proposal is at https://github.com/franz1981/quarkus/tree/append_buffer

~~Right now I got 3 failed tests using the AppendBuffer::withExactChunks version (and working on fixing it)~~ All tests on reasteasy-reactive-vertx work with AppendBuffer::withExactChunks as well.

The idea is to have Jackson able to use a "special" buffer that is aware that the caller is already batchy and don't try to be "smart".

The append buffer I've designed can work in 3 different ways:

AppendBuffer::withExactChunks: where every write accumulate buffers of exact requested size, unless reaching the max capacity, then caller must flush the current batch (and that's going to create a single composite ByteBuf for Vertx/Netty) to keep on appending
AppendBuffer::withEagerChunks is the same as it is now ie buffers are always eagerly allocated of a given chunk size and will be flushed when no space is available anymore
not yet exposed yet API: overall capacity before flush is different from chunk size; eg 8K overall capacity, but accumulating eagerly allocated 1K buffer sized - unless the last one before reaching the max capacity, which can be smaller

I would use AppendBuffer::withExactChunks with batchy appenders (a-la Jackson) that's using already some internal buffer to accumulate writes and AppendBuffer::withEagerChunks elsewhere (to not impact existing use cases).

It's important for AppendBuffer::withExactChunks to work at its best that the "batchy" caller maximize itself the component added to the append buffer by NOT using small batches - it has been designed with that in mind.

I didn't yet implemented any mechanism to release components in case of errors, but I would likely implement those in the append buffer itself, for simplicity - the original code was very well designed to save leak and I see no tests (I would add some to be sure to keep this nice feature).

Feedbacks are welcome before sending a PR

stuartwdouglas commented 1 year ago

Can I propose a slightly different approach?

Use the withExactChunks approach for all writes over 256 bytes
If the write is smaller than 256 allocate the full 256 byte buffer to prevent pathologically bad performance for the write(int) method

Most uses will be a single exact write, so withExactChunks is the correct approach, and even multiple larger writes it will work fine as long as the writes are then merged into a composite buffer so there is only a single writev call.

You might also want to consider a high/low water mark approach to max buffer size. E.g. once the buffer hits 8k we flush it, but if we have a single large write we can allocate up to 16k or more because we know it will immediately be sent to Netty.

It would also be good to know what the actual optimal chunk size is these days. When I was doing Undertow I got the best performance using 16kb write calls, but that was a long time ago, and all hardware and software involved is now massivly obsolete.

So to try and sum up all those points it would look something like:

Allocate exact buffers sizes for everything over 256 bytes
For smaller writes allocate up to 256 so we are not allocating one buffer per byte (possible optimization: if they call the 'write(int)' method we assume its all single byte writes, and allocate a larger buffer)
Once we have >8k worth of buffers write them out, as long as the total allocation will be <16k
If the amount would exceed 16k then allocate a buffer that will take the payload to 16k, and the rest can be written out in a new call

Obviously all these numbers are made up, and should be replaced with numbers from hard evidence rather than my gut feeling.

stuartwdouglas commented 1 year ago

On second thoughts the 'high water/low water' approach might not be the best idea, especially if your batch size is properly calibrated. It could result in a little bit of overflow being sent in its own packet instead of being batched.

geoand commented 1 year ago

I would propose testing both withExactChunks and withEagerChunks on some typical json output and then try to decide which strategy fits best for the common scenario (or see if we can adapt based on a pattern).

BTW, @franz1981 you probably also want to run the tests in extensions/resteasy-reactive/quarkus-resteasy-reactive/deployment and extensions/resteasy-reactive/quarkus-resteasy-reactive-jackson/deployment

geoand commented 1 year ago

I would really like to get something along these lines in however, because as I understand it by what Franz has said from his preliminary investigations, this will lead to non trivial gains in terms of memory usage

franz1981 commented 1 year ago

@stuartwdouglas

To summarize, we have here a mix of different things we would like to do right I see:

be mechanical sympathetic with the networking stack re writev vs write buffer size ie how much small the buffers can be in order to not make writev inefficient for the linux kernel?
(it was my point) be mechanical sympathetic with the Netty allocator, to make it fully use the capacity of its TLABs (that are grouped in different aligned sizes) while used by batchy callers - and by consequence, do not waste TLAB memory for small writes that cannot be batched (HTTP 1.1 without pipelining with "small" responses)

In addition, I would add the behaviour of HTTP 11 (didn't checked for HTTP 2) at https://github.com/netty/netty/blob/9f10a284712bbd7f4f8ded7711a50bfaff38f567/codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java#L334

In short, if the previously allocated headers buffers (Netty uses a statistics to size them, based on past data points) are big enough to host the provided content (ie what quarkus is passing), it would write it there instead of passing it to the transport as it is ie potentially a composite buffer.

The point made by @stuartwdouglas is fair: right now the buffer is using a bi-modal logic where you can disable the limits in the single writes (as long as they fit into the over all total capacity offered - that's ~8K by default), but I can slightly change it in order to still have the notion of total capacity (which influence when an append cause the caller to perform a flush -> and a syscall to write/writev, by consequence) but introduce the concept of minChunkSize that's used to decide what's the "minimal" (opposed to the current chunkSize) capacity of the single chunks where to batch writes into. If the incoming write is already bigger then minChunkSize, it would use the exact one (till the overall capacity), while if smaller, it will still allocate minChunkSize. to cope for "future" writes. This should be able to work as per @stuartwdouglas suggestion and save having different buffer configurations while using Jackson vs the others (that's good).

The sole concern here (meaning that I gotta band my head on this a bit more), is that batching for a minimum of minChunkSize will still underutilized the different TLABs available to Netty: I'll take a look better how they are constructed (in which classes of size) again. But still: if we batch we would save buffer instances as well, that will have an on-heap impact too (and likely saving composite buffers in the super-happy path, that's great, really).

franz1981 commented 1 year ago

I've added a further commit to modify the behavior as suggested by @stuartwdouglas at https://github.com/franz1981/quarkus/commit/f428f937593c3b3cbaf6dbead6d71887129d7540 (same branch as before, but new commit): right now is using a magic value of 128 ie 2 cache lines, but we can choose something more appropriate in case. I've left the option of a fully caller-driven chunk allocation version ie withExactChunks because I still believe that in cases where we can trust a batchy caller, there's no need to try being smart, but TBH if I made the check on the last buffer capacity left cheap, it shouldn't be needed (that's why I've introduced this option too)

I am now checking how the size classes of Netty are composed and searching on https://www.kernel.org/doc/html/latest/index.html if I can find anything interesting re writev suggested buffer size

stuartwdouglas commented 1 year ago

Something else to consider if we want absolute maximum performance is that the first write will also have the headers, while subsequent writes won't (although they may have chunking data).

Does Netty handle this sort of batching optimisation? Maybe a Netty handler is actually the correct place because you can just deal with fixed size buffers without needing to know about headers etc.

My gut feeling is that you would want to batch close to the MTU size so that every writev call maps to a single packet.

franz1981 commented 1 year ago

Does Netty handle this sort of batching optimisation?

The netty http 1.1 encoder is smart enough to try estimating the http headers based on a statistics of the previously sent one, allocating a single buffer in one go - and if the payload (non-chunked) is small enough, can be usually written directly into it (due to padding for alignments of the Netty allocator). Hence, Netty try hard at the handler level to save having indivitual (and small) buffers to deal with. Vertx as well try hard to flush very late (but without pipelining this is sadly just a latency delay here). I can tell that we try to be batchy at any level of the stack. But without any hard constrain on the size of such batches; just "moar is better" let's say.

My gut feeling is that you would want to batch close to the MTU size so that every writev call maps to a single packet.

That's more a problem of the overall capacity then the single chunk (that's the hidden detail that can cause composite buffers to be used), which cost is mostly due to:

individual ref cnt on Netty (both acquire and release after written)
pointer chasing cost on the kernel to place the buffer
smaller batches while kernel memcpy (in a batchy way) to the TCP buffer
buffer wrapper infrastructural cost

if we batch enough, even by exceeding the MTU, we guarantee decent packet utilization (we amortize the single TCP header packet cost by putting enough payload), even with fragmentation.

franz1981 commented 1 year ago

I don't want again to define a "wrong" minChunkSize that just makes allocations not efficient due to not correctly using Netty TLABs: is important to remember that missing TLAB allocations would cause using the shared arenas of the allocator and impacting (if unlucky) other threads (I/O and not) then causing a small scalability issue. This was the first reason to just use whatever was the requested required capacity and just let Netty "do its thing" - said that, I agree that we should find a proper value to save obviously bad things to happen (sub-100 B or 1K allocations). I hope next week to send a draft pr with some numbers in

quarkusio / quarkus

ResteasyReactiveOutputStream could make better use of the Netty (direct) pooled allocator #32546

Description

Implementation ideas