Closed vphantom closed 11 months ago
I like the first one for sure. Since the size estimation would be recursive, your technique should also make what @c-cube proposed in #157 redundant. (just to confirm I got this right).
Oh, yes indeed, since we'd be storing to a single output instead of one per Protobuf message.
As for the second part, basically I'd like to offer users the option to append the binary encoding to a buffer they already own (which could be a few things), instead of generating a new one to copy from. With Bin_prot
for example we can tell the encoder to write to a bytes
we supply, starting at some offset. With big strings especially, not only would we generate the binary encoding without buffer copying, but it could make its way out to files or network sockets without any additional copying as well (i.e. Lwt_bytes
).
For nested types, wouldn't that mean traversing the values several times? And performing the varint encoding several times too, to measure how many bytes are required?
This can still be a performance win in practice. I'm just pointing our the potential accidentally quadratic behavior :)
For nested types, that's the caveat I noted (3rd paragraph of "encoding" section): I'll want to calculate the size of every message exactly once. Some kind of temporary memoization at that level, I suppose.
For varints specifically, note that determining the size doesn't involve actually encoding, but just predicting how many multiples of 7 bits you need. That should be quite fast.
I'll definitely take hints from Bin_prot
's implementation when I get to that, because they rely on it heavily and thus invested significant efforts in optimizing it. Key parts of it are in C but the sizing is in OCaml so it'll be worth checking out for inspiration.
For nested types, that's the caveat I noted …
You can guarantee a single length calculation for nested objects by serializing them back to front and keeping track of the written bytes (roughly a reverse in-order traversal of the tree and returning the total written bytes up the stack, s.t. you know the total length of the just-written content before writing the tag).
That's not possible, since the length of a block is written before its data and is itself encoded as a Varint. This is why the current implementation encodes recursively and copies buffers at each nesting step. My goal here is to remove this copying by calculating sizes (which is fairly cheap).
You are correct iff you write the tag before the content.
The idea is that instead of writing from the front of your buffer starting with the tag, you start with the deepest-nested object writing it to the end of the buffer, then prepend its tag, passing the written size up the recursion. So you’d 1. write content, 2. write tag, 3. return written bytes incl the tag.
This is definitely possible (and works, I’ve tried it in production) but it requires either a method for approximating the buffer size you need, or a way to copy into a larger buffer if the message turns out to be too large. Also, the complexity may not be warranted here, as this is probably not used for low latency/high throughput situations.
I like this idea, although I think here it would not be ideal:
Bin_prot
-style sizing pass*If the internal storage used was a Bigstring, a "sub" could be returned for free, which would solve that problem.
Is this really worth the complexity? Allocating a bigstring might fall short (if it's too small), causing a lot of trouble to re-encode again with a larger buffer. Computing sizes means doing a lot of the work twice, or even more (quadratically in the depth of nested messages I believe).
I think 0-copy is not worth it, especially not in OCaml. Just use a small pool of buffers for sub-messages, so as to re-use them (GC pressure is important) and move on :)
With deep structures of messages (as I will deal with), I believe the difference can be significant. Again, this idea presupposes computing sizes exactly once, not multiple times. There's no quadratic work at all here. Calculating the size of a message is cheap if done only once. I will back this up with proper benchmarking before formulating a PR, of course.
Also, FWIW a fixed-size buffer having to be reallocated larger in @RNabel 's scenario would not imply restarting the encoding from scratch. That would be wasteful. For example, Stdlib.Buffer
reallocates the underlying storage area seamlessly as needed and so do I in my personal bytes
and Bigstring buffer I/O modules.
Where do you store the precomputed sizes? The types to be encoded are external, they don't have hidden fields to store their length, as far as I can tell. If you can't store the length, how do you compute it once? That's what evades me.
I haven't landed on a solution for that yet. This issue is a back-burner brainstorm for me, not something I'm implementing this week. (Things like Yojson support for Protobuf map
are more pressing.) But rest assured that finding an elegant solution to "memoize" those sizes is a requirement, so I'll either come up with something efficient or abandon the issue.
@RNabel I tried the write-from-backward method in #169 and it's not a big improvement on my (synthetic) benchmark. Reusing sub-buffers is simpler and better.
However. I have an idea on how to do the size-computation pass efficiently. The main issue is that, for a message at (say) depth 3, its size will be computed twice: once when it's time to output it, and before that, once to compute the size of its parent (which, being at depth 1, is nested and must be length prefixed). At depth 4, the size will be computed thrice, I think, and so on. The total amount of work is quadratic in the depth of the values, which can be pathological if one has recursive messages.
Anyway, that's why we need to cache sizes if we go this way. To do so, if we standardize on the traversal order (meaning, encode_foo
and size_foo
enumerate message fields in the same order), we can keep a stack of sizes (a vector, really. Needs O(1) access.) in the same order as the traversal. Every time we compute the size of a sub-message, we push it onto the stack, and increment the current offset within that stack. Then, when it's time to actually output the message, we do the same traversal, actually emitting data; the size of the i
-th sub-message can be then simply be found at index i
in the stack.
The order must basically be a post-order (parent messages being traversed after their children).
This is somehow similar to how imgui and the likes, use the traversal order (of the tree of widgets) to assign each item a unique index, which can then be used to carry state from an iteration of the main loop to the next. Here we do exactly 2 iterations and the state we carry is the size of sub-messages.
@c-cube I really like that stack which relies on a deterministic traversal order. A kind of int Buffer.t
would've been neat for this. I assume you'd do something similar to your CCVector
(append
and get
)?
Yes, that's an idea. It takes 20 lines to do a little custom one for protoc. The issue though is that i think this would need more code generation (for a size function) or to have a mode where the encode function actually doesn't write data but just pushes sizes.... So that's more complicated.
It's also not guaranteed it'll be faster than reusing sub buffers :)
I don't understand the buffer recycling idea. If there was a pool of buffers, sure there'd be a few less allocations (less calls to Buffer.create
) but there'd be just as much copying as we go back up the tree of messages.
There's still copying, but it doesn't mean it's slower than traversing twice and storing sizes in a side-buffer (which is also more complex, I think, because of the need for 2 different kinds of traversals). Benchmarks can be used to assess which is faster in practice.
About the size function: https://github.com/gogo/protobuf/issues/559 . They seem to have settled to writing backwards, which is definitely the cleanest solution conceptually, I think.
Doesn't writing backwards require having a big enough buffer to begin with? Or do they resize periodically like Buffer
does, except at the head instead of the tail?
Exactly, that's what the benchmark example already does: resizing like any other dynamic vector, but once you resize you copy to the end, not the beginning. So really it just requires a custom buffer that writes from the end.
The benefits of this approach, I think:
however it's fiddly, and requires a few modifications of the generated code (not too big), so as to use combinators for lists, and for fields (since we want to control that field tag/list items are emitted in the reverse order). I think it's still worth it if we want a robust at scale solution.
Yes, I really like that idea. I can't think of other uses to having a sizing function besides actually encoding, so if we can avoid it entirely we should. It seems obvious to me even without a formal benchmark, that writing in reverse once would be faster than calculating the size (properly memoized) and then doing the same writing but forwards.
My only concern is that without knowing sizes, we end up with an area of memory which is larger than what we need to return. That wouldn't be a problem with a bigstring
(its sub()
is "free") but not with bytes
, which would require blitting to a brand new shorter one or returning an int * bytes
(starting position + data) tuple if callers can handle that. 🤔
That's always been the case, even with the current simple Buffer.t
. Allocating the exact size is not realistic imho. The better way is to reuse the encoder (with the recently added Encoder.clear
/Encoder.reset
:wink:) so that you don't allocate at all in the long run.
It might be trivial to even eliminate that final alloc+blit for users who don't want it. For example if the new "from-the-end" internal buffer used a bigstring
internally (could make sense since we're talking about it being long-lived, although I heard I/O to those is slower?), then there could be two flavors of the encoder API:
bytes
" function would blit
the useful part of the internal bigstring
into a fresh bytes
, which we're already doing anyway as you pointed out.bigstring
" function would return a sub
directly and invalidate the internal buffer (relinquish ownership).(Using a bigstring
might need vendoring a few of the stubs from Bigstringaf
though, since the Stdlib
is still lacking those.)
That being said, writing in reverse to a single area will already be a massive improvement!
Or, avoid bigstrings entirely, and return (bytes,offset)
on demand :). That also has the benefit of not being horribly slow if people don't reuse the buffers, and avoids some dependencies.
Yes, returning the data storage and the offset instead of blitting would be nice to have as an option. I'll still need to blit that into a bigstring
for some I/O but at least that's one copy instead of two.
It'd be fun to abstract away the storage entirely, but since we want to write in reverse that'd be very non-standard.
If you do non blocking IOs, then yes. If you do blocking IOs you generally can directly use the byte buffer + offset :)
I found this very interesting document: https://perfetto.dev/docs/design-docs/protozero
there's in particular mention of "scattered IO", which certainly must help with nested messages:
A key part of the ProtoZero design is supporting direct serialization on non-globally-contiguous sequences of contiguous memory regions.
With https://github.com/mransan/ocaml-protoc/pull/223 we now write back-to-front, in one go. Benchmarks show that, if one reuses the encoder, there are close to 0 allocations done in the common case.
For example on my i7-12700, here are a few benchmarks ("current" is the new master branch, "nested-bufs" is what we had until recently with cached nested buffers):
292 bytes value:
Rate basic-buffer write-backward-c nested-bufs write-backward-noinline write-backward current
basic-buffer 1392231+- 84365/s -- -18% -18% -19% -25% -48%
write-backward-c 1693318+- 5067/s 22% -- [-0%] [-1%] -9% -37%
nested-bufs 1697419+- 61896/s 22% [0%] -- [-1%] -9% -37%
write-backward-noinline 1708411+-122597/s 23% [1%] [1%] -- [-8%] -37%
write-backward 1864098+- 20754/s 34% 10% 10% [9%] -- -31%
current 2693451+- 13252/s 93% 59% 59% 58% 44% --
minor_allocs/iter major_allocs/iter promoted/iter
basic-buffer 6.23 kB 0 B 2 B
current 120 B 0 B 0 B
nested-bufs 3.90 kB 0 B 0 B
write-backward 3.90 kB 0 B 0 B
write-backward-noinline 5.87 kB 0 B 0 B
write-backward-c 3.90 kB 0 B 0 B
3050 bytes value:
Rate basic-buffer write-backward-c nested-bufs write-backward-noinline write-backward current
basic-buffer 112964+- 522/s -- -27% -27% -32% -32% -55%
write-backward-c 153812+-9692/s 36% -- [-1%] [-7%] [-8%] -38%
nested-bufs 155589+- 463/s 38% [1%] -- -6% -7% -38%
write-backward-noinline 165114+- 631/s 46% [7%] 6% -- [-1%] -34%
write-backward 166775+-6660/s 48% [8%] 7% [1%] -- -33%
current 249981+-6738/s 121% 63% 61% 51% 50% --
minor_allocs/iter major_allocs/iter promoted/iter
basic-buffer 81.91 kB 4.13 kB 126 B
current 1.24 kB 0 B 0 B
nested-bufs 41.09 kB 0 B 8 B
write-backward 41.09 kB 0 B 4 B
write-backward-noinline 61.87 kB 0 B 5 B
write-backward-c 41.09 kB 0 B 4 B
The new encoder can write at > 700MB/s (on a very good desktop machine, granted).
I have a spare-time idea to improve encoding performance: encoding to a single buffer. This low-hanging fruit would significantly reduce allocations when dealing with trees of messages by writing sequentially to a single
Buffer.t
.In Jane Street's
Bin_prot
, each type has a function which predicts the binary encoding size for its current state. I think it would be quite doable to implement a kind of…_payload_pb.size
function which_payload_pb.encode_…
could then use to know its size and its nested messages' sizes to avoid having to use temporary buffers. OCaml's memory representation ofstring
andbytes
already includes size, which is thus very efficient to get, and it's trivial to predict the size of a Varint for field IDs, enums, ints and message sizes themselves.The main issue to watch out for there would be to take note of sizes collected during the process, to avoid calculating them more than once.
A nice-to-have addition to this could be to abstract reading and writing into a module signature instead of using
Buffer
directly. This would allow users to use other types of buffers or channels.References