Zero-copy I/O - Githubissues

vphantom commented 2 years ago

I have a spare-time idea to improve encoding performance: encoding to a single buffer. This low-hanging fruit would significantly reduce allocations when dealing with trees of messages by writing sequentially to a single Buffer.t.

In Jane Street's Bin_prot, each type has a function which predicts the binary encoding size for its current state. I think it would be quite doable to implement a kind of …_payload_pb.size function which _payload_pb.encode_… could then use to know its size and its nested messages' sizes to avoid having to use temporary buffers. OCaml's memory representation of string and bytes already includes size, which is thus very efficient to get, and it's trivial to predict the size of a Varint for field IDs, enums, ints and message sizes themselves.

The main issue to watch out for there would be to take note of sizes collected during the process, to avoid calculating them more than once.

A nice-to-have addition to this could be to abstract reading and writing into a module signature instead of using Buffer directly. This would allow users to use other types of buffers or channels.

References

https://github.com/issuu/ocaml-protoc-plugin/blob/master/src/ocaml_protoc_plugin/writer.ml#L33 In their case, they have an intermediary representation and they seem to use distinct buffers at every recursion layer, but it's an example of a sizing function nonetheless.
https://github.com/janestreet/bin_prot/blob/master/src/size.ml Kitchen sink size calculations, not sure how they use it yet.

mransan commented 2 years ago

I like the first one for sure. Since the size estimation would be recursive, your technique should also make what @c-cube proposed in #157 redundant. (just to confirm I got this right).

vphantom commented 2 years ago

Oh, yes indeed, since we'd be storing to a single output instead of one per Protobuf message.

As for the second part, basically I'd like to offer users the option to append the binary encoding to a buffer they already own (which could be a few things), instead of generating a new one to copy from. With Bin_prot for example we can tell the encoder to write to a bytes we supply, starting at some offset. With big strings especially, not only would we generate the binary encoding without buffer copying, but it could make its way out to files or network sockets without any additional copying as well (i.e. Lwt_bytes).

c-cube commented 2 years ago

For nested types, wouldn't that mean traversing the values several times? And performing the varint encoding several times too, to measure how many bytes are required?

This can still be a performance win in practice. I'm just pointing our the potential accidentally quadratic behavior :)

vphantom commented 2 years ago

For nested types, that's the caveat I noted (3rd paragraph of "encoding" section): I'll want to calculate the size of every message exactly once. Some kind of temporary memoization at that level, I suppose.

For varints specifically, note that determining the size doesn't involve actually encoding, but just predicting how many multiples of 7 bits you need. That should be quite fast.

I'll definitely take hints from Bin_prot's implementation when I get to that, because they rely on it heavily and thus invested significant efforts in optimizing it. Key parts of it are in C but the sizing is in OCaml so it'll be worth checking out for inspiration.

RNabel commented 2 years ago

For nested types, that's the caveat I noted …

You can guarantee a single length calculation for nested objects by serializing them back to front and keeping track of the written bytes (roughly a reverse in-order traversal of the tree and returning the total written bytes up the stack, s.t. you know the total length of the just-written content before writing the tag).

vphantom commented 2 years ago

That's not possible, since the length of a block is written before its data and is itself encoded as a Varint. This is why the current implementation encodes recursively and copies buffers at each nesting step. My goal here is to remove this copying by calculating sizes (which is fairly cheap).

RNabel commented 2 years ago

You are correct iff you write the tag before the content.

The idea is that instead of writing from the front of your buffer starting with the tag, you start with the deepest-nested object writing it to the end of the buffer, then prepend its tag, passing the written size up the recursion. So you’d 1. write content, 2. write tag, 3. return written bytes incl the tag.

This is definitely possible (and works, I’ve tried it in production) but it requires either a method for approximating the buffer size you need, or a way to copy into a larger buffer if the message turns out to be too large. Also, the complexity may not be warranted here, as this is probably not used for low latency/high throughput situations.

vphantom commented 2 years ago

I like this idea, although I think here it would not be ideal:

Preallocating a buffer of sufficient size would be tricky to get right without doing the Bin_prot-style sizing pass
Unused space at the beginning implies copying the result to make something returnable*
It would be impossible to append directly to a caller-supplied buffer (my ultimate goal)

*If the internal storage used was a Bigstring, a "sub" could be returned for free, which would solve that problem.

c-cube commented 2 years ago

Is this really worth the complexity? Allocating a bigstring might fall short (if it's too small), causing a lot of trouble to re-encode again with a larger buffer. Computing sizes means doing a lot of the work twice, or even more (quadratically in the depth of nested messages I believe).

I think 0-copy is not worth it, especially not in OCaml. Just use a small pool of buffers for sub-messages, so as to re-use them (GC pressure is important) and move on :)

vphantom commented 2 years ago

With deep structures of messages (as I will deal with), I believe the difference can be significant. Again, this idea presupposes computing sizes exactly once, not multiple times. There's no quadratic work at all here. Calculating the size of a message is cheap if done only once. I will back this up with proper benchmarking before formulating a PR, of course.

Also, FWIW a fixed-size buffer having to be reallocated larger in @RNabel 's scenario would not imply restarting the encoding from scratch. That would be wasteful. For example, Stdlib.Buffer reallocates the underlying storage area seamlessly as needed and so do I in my personal bytes and Bigstring buffer I/O modules.

c-cube commented 2 years ago

Where do you store the precomputed sizes? The types to be encoded are external, they don't have hidden fields to store their length, as far as I can tell. If you can't store the length, how do you compute it once? That's what evades me.

vphantom commented 2 years ago

I haven't landed on a solution for that yet. This issue is a back-burner brainstorm for me, not something I'm implementing this week. (Things like Yojson support for Protobuf map are more pressing.) But rest assured that finding an elegant solution to "memoize" those sizes is a requirement, so I'll either come up with something efficient or abandon the issue.

c-cube commented 2 years ago

@RNabel I tried the write-from-backward method in #169 and it's not a big improvement on my (synthetic) benchmark. Reusing sub-buffers is simpler and better.

However. I have an idea on how to do the size-computation pass efficiently. The main issue is that, for a message at (say) depth 3, its size will be computed twice: once when it's time to output it, and before that, once to compute the size of its parent (which, being at depth 1, is nested and must be length prefixed). At depth 4, the size will be computed thrice, I think, and so on. The total amount of work is quadratic in the depth of the values, which can be pathological if one has recursive messages.

Anyway, that's why we need to cache sizes if we go this way. To do so, if we standardize on the traversal order (meaning, encode_foo and size_foo enumerate message fields in the same order), we can keep a stack of sizes (a vector, really. Needs O(1) access.) in the same order as the traversal. Every time we compute the size of a sub-message, we push it onto the stack, and increment the current offset within that stack. Then, when it's time to actually output the message, we do the same traversal, actually emitting data; the size of the i-th sub-message can be then simply be found at index i in the stack.

The order must basically be a post-order (parent messages being traversed after their children).

This is somehow similar to how imgui and the likes, use the traversal order (of the tree of widgets) to assign each item a unique index, which can then be used to carry state from an iteration of the main loop to the next. Here we do exactly 2 iterations and the state we carry is the size of sub-messages.

vphantom commented 2 years ago

@c-cube I really like that stack which relies on a deterministic traversal order. A kind of int Buffer.t would've been neat for this. I assume you'd do something similar to your CCVector (append and get)?

c-cube commented 2 years ago

Yes, that's an idea. It takes 20 lines to do a little custom one for protoc. The issue though is that i think this would need more code generation (for a size function) or to have a mode where the encode function actually doesn't write data but just pushes sizes.... So that's more complicated.

It's also not guaranteed it'll be faster than reusing sub buffers :)

vphantom commented 2 years ago

I don't understand the buffer recycling idea. If there was a pool of buffers, sure there'd be a few less allocations (less calls to Buffer.create) but there'd be just as much copying as we go back up the tree of messages.

c-cube commented 2 years ago

There's still copying, but it doesn't mean it's slower than traversing twice and storing sizes in a side-buffer (which is also more complex, I think, because of the need for 2 different kinds of traversals). Benchmarks can be used to assess which is faster in practice.

c-cube commented 2 years ago

About the size function: https://github.com/gogo/protobuf/issues/559 . They seem to have settled to writing backwards, which is definitely the cleanest solution conceptually, I think.

vphantom commented 2 years ago

Doesn't writing backwards require having a big enough buffer to begin with? Or do they resize periodically like Buffer does, except at the head instead of the tail?

c-cube commented 2 years ago

Exactly, that's what the benchmark example already does: resizing like any other dynamic vector, but once you resize you copy to the end, not the beginning. So really it just requires a custom buffer that writes from the end.

The benefits of this approach, I think:

the best in terms of cache (exactly one traversal of the data)
the best in terms of buffer storage (only one buffer; no memoization of sizes)
still somewhat simple conceptually

however it's fiddly, and requires a few modifications of the generated code (not too big), so as to use combinators for lists, and for fields (since we want to control that field tag/list items are emitted in the reverse order). I think it's still worth it if we want a robust at scale solution.

vphantom commented 2 years ago

Yes, I really like that idea. I can't think of other uses to having a sizing function besides actually encoding, so if we can avoid it entirely we should. It seems obvious to me even without a formal benchmark, that writing in reverse once would be faster than calculating the size (properly memoized) and then doing the same writing but forwards.

My only concern is that without knowing sizes, we end up with an area of memory which is larger than what we need to return. That wouldn't be a problem with a bigstring (its sub() is "free") but not with bytes, which would require blitting to a brand new shorter one or returning an int * bytes (starting position + data) tuple if callers can handle that. 🤔

c-cube commented 2 years ago

That's always been the case, even with the current simple Buffer.t. Allocating the exact size is not realistic imho. The better way is to reuse the encoder (with the recently added Encoder.clear/Encoder.reset :wink:) so that you don't allocate at all in the long run.

vphantom commented 2 years ago

It might be trivial to even eliminate that final alloc+blit for users who don't want it. For example if the new "from-the-end" internal buffer used a bigstring internally (could make sense since we're talking about it being long-lived, although I heard I/O to those is slower?), then there could be two flavors of the encoder API:

The current "please encode to bytes" function would blit the useful part of the internal bigstring into a fresh bytes, which we're already doing anyway as you pointed out.
A new "please encode to bigstring" function would return a sub directly and invalidate the internal buffer (relinquish ownership).

(Using a bigstring might need vendoring a few of the stubs from Bigstringaf though, since the Stdlib is still lacking those.)

That being said, writing in reverse to a single area will already be a massive improvement!

c-cube commented 2 years ago

Or, avoid bigstrings entirely, and return (bytes,offset) on demand :). That also has the benefit of not being horribly slow if people don't reuse the buffers, and avoids some dependencies.

vphantom commented 2 years ago

Yes, returning the data storage and the offset instead of blitting would be nice to have as an option. I'll still need to blit that into a bigstring for some I/O but at least that's one copy instead of two.

It'd be fun to abstract away the storage entirely, but since we want to write in reverse that'd be very non-standard.

c-cube commented 2 years ago

If you do non blocking IOs, then yes. If you do blocking IOs you generally can directly use the byte buffer + offset :)

c-cube commented 2 years ago

I found this very interesting document: https://perfetto.dev/docs/design-docs/protozero

there's in particular mention of "scattered IO", which certainly must help with nested messages:

A key part of the ProtoZero design is supporting direct serialization on non-globally-contiguous sequences of contiguous memory regions.

c-cube commented 11 months ago

With https://github.com/mransan/ocaml-protoc/pull/223 we now write back-to-front, in one go. Benchmarks show that, if one reuses the encoder, there are close to 0 allocations done in the common case.

For example on my i7-12700, here are a few benchmarks ("current" is the new master branch, "nested-bufs" is what we had until recently with cached nested buffers):

292 bytes value:

                             Rate         basic-buffer write-backward-c nested-bufs write-backward-noinline write-backward current
           basic-buffer 1392231+- 84365/s           --             -18%        -18%                    -19%           -25%    -48%
       write-backward-c 1693318+-  5067/s          22%               --       [-0%]                   [-1%]            -9%    -37%
            nested-bufs 1697419+- 61896/s          22%             [0%]          --                   [-1%]            -9%    -37%
write-backward-noinline 1708411+-122597/s          23%             [1%]        [1%]                      --          [-8%]    -37%
         write-backward 1864098+- 20754/s          34%              10%         10%                    [9%]             --    -31%
                current 2693451+- 13252/s          93%              59%         59%                     58%            44%      --
                        minor_allocs/iter major_allocs/iter promoted/iter
           basic-buffer           6.23 kB               0 B           2 B
                current             120 B               0 B           0 B
            nested-bufs           3.90 kB               0 B           0 B
         write-backward           3.90 kB               0 B           0 B
write-backward-noinline           5.87 kB               0 B           0 B
       write-backward-c           3.90 kB               0 B           0 B

3050 bytes value:

                            Rate       basic-buffer write-backward-c nested-bufs write-backward-noinline write-backward current
           basic-buffer 112964+- 522/s           --             -27%        -27%                    -32%           -32%    -55%
       write-backward-c 153812+-9692/s          36%               --       [-1%]                   [-7%]          [-8%]    -38%
            nested-bufs 155589+- 463/s          38%             [1%]          --                     -6%            -7%    -38%
write-backward-noinline 165114+- 631/s          46%             [7%]          6%                      --          [-1%]    -34%
         write-backward 166775+-6660/s          48%             [8%]          7%                    [1%]             --    -33%
                current 249981+-6738/s         121%              63%         61%                     51%            50%      --
                        minor_allocs/iter major_allocs/iter promoted/iter
           basic-buffer          81.91 kB           4.13 kB         126 B
                current           1.24 kB               0 B           0 B
            nested-bufs          41.09 kB               0 B           8 B
         write-backward          41.09 kB               0 B           4 B
write-backward-noinline          61.87 kB               0 B           5 B
       write-backward-c          41.09 kB               0 B           4 B

The new encoder can write at > 700MB/s (on a very good desktop machine, granted).

mransan / ocaml-protoc

Zero-copy I/O #161

References