google / flatbuffers

FlatBuffers: Memory Efficient Serialization Library
https://flatbuffers.dev/
Apache License 2.0
23.28k stars 3.25k forks source link

[Request] Allow size and offset type to be parameterized #5471

Closed dleslie closed 2 years ago

dleslie commented 5 years ago

As of writing this issue it appears that FlatBuffers encodes most lengths, sizes and locations as Integers; there are some locations where Offsets are store as Shorts. This is a waste of bytes for many applications where the encoded data barely extends beyond the maximum value represented by a byte, let alone a 32-bit integer.

As a naive test, I took some time to replace FlatBuffers' numeric representations that use Int32 with Int16, and saved around 20 bytes on-average for every object encoded in our application. This amounts to approximately 5 to 20 percent of the total size, depending on what is being encoded.

If I may be so bold, I would suggest allowing all sizes and offsets to have their encoded type parameterized by flatc. Some applications can get away with only using byte to represent such values, whereas others may need a 32-bit integer.

aardappel commented 5 years ago

The reason it is hard-coded to 1 size is for speed: when you access these buffers, it is doing lots of quick offsets calculations behind the scenes. If there were lots of conditionals in there determining the size of the offset, or a varint say, that would significantly slow down access.

So any such changes would have to be at compile time, but would have the disadvantage of "forking" the format, a 16-bit FlatBuffers would need all tools and all code reading/writing to be in a agreement. We so far haven't done this, as it doesn't seem to be worth it.

The top 2 items here discuss a possible 64-bit FlatBuffers: https://github.com/google/flatbuffers/projects/10

Related: https://github.com/google/flatbuffers/issues/4240

dleslie commented 5 years ago

That's interesting; I did imagine it to be a compile time change, in the manner of n-byte FlatBuffers.

-------- Original Message -------- On Aug. 5, 2019, 11:56 a.m., Wouter van Oortmerssen wrote:

The reason it is hard-coded to 1 size is for speed: when you access these buffers, it is doing lots of quick offsets calculations behind the scenes. If there were lots of conditionals in there determining the size of the offset, or a varint say, that would significantly slow down access.

So any such changes would have to be at compile time, but would have the disadvantage of "forking" the format, a 16-bit FlatBuffers would need all tools and all code reading/writing to be in a agreement. We so far haven't done this, as it doesn't seem to be worth it.

The top 2 items here discuss a possible 64-bit FlatBuffers: https://github.com/google/flatbuffers/projects/10

Related: #4240

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

llchan commented 4 years ago

Random idea: I think in the existing 32-bit-offsets world the root table is guaranteed to be at a 4-byte-aligned offset, which means the 2 lowest bits in the first byte are always zero in existing flatbuffers. If we wanted to do a 16-bit mode, we could potentially use the lowest bit to mark it as 16-bit mode, and mask that off when jumping to the root table (pretty cheap one-time mask per root access). That way a 16-bit mode reader could detect if it's being fed in a 32-bit message and fail fast. Newly-compiled 32-bit readers could also detect the converse. However, legacy 32-bit readers would not be protected from future 16-bit messages, but maybe that's a reasonable risk as long as the user is aware of it. As long as it's a conscious decision to opt into 16-bit mode I think it's reasonable to assume the developer has taken the necessary precautions to rebuild readers/writers.

This strategy doesnt help if we also want to do a 64-bit mode (or 8-bit mode), but I think 16-bit mode is the most useful of these. Someone who needs gigantic messages can add their own framing, but it's not currently possible to make flatbuffers more compact.

aardappel commented 4 years ago

@llchan I like the idea of using lower bits, since in addition to having new readers being able to detect it, the offset can maybe be set such that on older readers its an illegal buffer, e.g. it fails the verifier. Current users of FlatBuffers must already ensure that either they only consume buffers from a known (local) source, or that they use the verifier.

That said, if we're going to allow other bit-sizes, we should really also have a 64-bit strategy at the same time. I've actually been asked for that more often than 16-bit. The amount of people that would like to access in-memory data stores of some kind >2GB is not insignificant, it appears.

Possibly, 16-bit support could be done in the same way as 64-bit support is suggested in the above link: by allowing it to be specified per table/vector/field. That has the advantage the format is not being forked, and since schemas with new annotations need a recompile to take effect, it would potentially not have backwards compatible issues (if you'd declare it to be illegal on existing decls). It could also allow the same efficient situation as for 64-bit: you could have a buffer where all tables consist entirely out of 16-bit offsets, and only the root vector being 32-bit offsets, getting most of the data compression with none of the size limitation.

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 6 months with no activity. Please comment or this will be closed in 14 days.

dleslie commented 4 years ago

This is still important to me. I'm maintaining a half-baked internal 8-bit offset fork, and would prefer to see this behaviour in mainline.

-------- Original Message -------- On Jun 12, 2020, 13:33, github-actions[bot] wrote:

This issue is stale because it has been open 6 months with no activity. Please comment or this will be closed in 14 days.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 6 months with no activity. Please comment or this will be closed in 14 days.

dleslie commented 3 years ago

Still important to me; still maintaining a branch with 8-bit offsets, lengths and etc.

aardappel commented 3 years ago

@dleslie I don't expect 8-bit offsets to be added to the main repo any time soon. You could try making a PR with support for 8/16/64? bit offsets, and we could see how invasive that is. uoffset_t is used in a LOT of places though, so I can't imagine it being minimal, seeing how many things it interacts with.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 6 months with no activity. Please comment or this will be closed in 14 days.

dleslie commented 3 years ago

@aardappel I'm sitting on a branch that makes most of the necessary changes for 8-bit support, but it's in no way PR-worthy as it's not parameterized through CLI input.

It's working for me/us for now.

TYoungSL commented 3 years ago

Would an extension or attribute to support non-standard offsets sizes be a good idea, or is FlatBuffers2 going to pop up soon?

Rjvs commented 3 years ago

@dleslie can you make that repo available at least?

aardappel commented 3 years ago

"FlatBuffers 2" is merely a collection of ideas, not an actual project, as the issue about it hopefully makes clear.

64-bit FlatBuffers could be supported though, independent from the other features. I'd like that to go somewhere along the lines of https://github.com/google/flatbuffers/projects/10#card-14545298 and should probably be discussed in its own issue if anyone wants to take it on.

Not sure about 8-bit, that sounds terribly niche, we'd first have to convince ourselves that 16-bit is even worth the churn/testing.

TYoungSL commented 3 years ago

"FlatBuffers 2" is merely a collection of ideas, not an actual project, as the issue about it hopefully makes clear.

64-bit FlatBuffers could be supported though, independent from the other features. I'd like that to go somewhere along the lines of https://github.com/google/flatbuffers/projects/10#card-14545298 and should probably be discussed in its own issue if anyone wants to take it on.

Not sure about 8-bit, that sounds terribly niche, we'd first have to convince ourselves that 16-bit is even worth the churn/testing.

@Rjvs and I are interested in 64-bit FlatBuffers, we're kicking around the idea of either an unofficial fork or extension.

jamescourtney commented 3 years ago

Here's a quick stab at a proposal, expanding on @llchan's idea above.

The gist of it is that we use the high order bit of the initial uoffset_t to indicate the presence of metadata flags that follow. These flags describe the physical structure of the buffer and are static in both the buffer and the generated parser. This way, the parser can quickly tell if it supports the given buffer. Further, keeping the generated code free of unnecessary conditional statements for things like decoding uoffset_t will help performance (Always/Never-taken branches are cheap, but they still do take up a slot in the branch predictor and have overhead.).

In detail...

[ uoffset_t, leading bit set], Positions 0 - 3
[ flags ], Positions 4 - 5
[ file ID, optional ], Positions 6 - 9
[ soffset_t of root table ], Variable (depending on flags)

There are some interesting things we can do with those flags, especially if you want to gradually start adding some of the FlatBuffers2 features:

// Mutually exclusive.
Flag_16BitOffset = 0x00,
Flag_32BitOffset = 0x01,
Flag_64BitOffset = 0x02,

// Indicates that data is not padded for alignment (not necessarily part of this work, just an example of potential flags).
Flag_UnalignedData = 0x04,

 // reserved for future addition of more flags if needed.
Flag_MoreFlags = 0x80,

As far as how to set these flags, I can see two ways:

TYoungSL commented 3 years ago

How would the #define interoperate with the rest of the included schema?

Do models defined have offset width specified per file, such that a nested table would have to be reflected as though it were with 64-bit offsets when it is declared in a file without the #define, would they nest with special behavior, or would it apply globally? Declaring them inside of the models (tables, structs) would make more sense if there is some sort of arbitrary dichotomy between file scopes; would be better to be different per model, no more complex.

Looking at how it's written, a compiler flag would be far easier to implement.

We forked FlatBuffers to make BigBuffers to just move directly to 64-bit offset size, keeping 16-bit vtable offsets and the fbs schema.

It's incomplete, binary schema generation has issues, but flatc generation works for C++ and C#. It might work for some other languages. I am not comfortable with the pattern of the unit test implementation in the project, and there's lots of languages to adapt. Lots of cruft for a single repository. There are a lot of test cases with magic numbers for offsets.

Took a stab at making it per model with a new IDL language token (first as an attribute, then as a intentionally-not-backwards-compatible 'directive' of sorts); remove global definition of uoffset_t, soffset_t, add an offset_size field to StructDef and TableDef, in some places typedef typename std::make_unsigned(offset_t)::type uoffset_t and likewise for soffset_t, switches around offset_size (cases for 1, 2, 4, 8, default throws std::exception) to select offset_t templating from std::uintN_t offset types, make use of C++11's templated type aliasing and template Offset<T> as Offset<T,offset_t>, template<T> using Offset = Offset<T,offset_t>... good many changes to the generated C++ too. Reflection gets split up by template specialization, needs some work.

Mixed offset width is indeed a hairy beast. So much cruft. Seems like a major undertaking.

The recursive dependency on the generated flatbuffers models makes it trickier than it should be.

Just moving to 64-bit offsets seems like the easier option for the short term.

aardappel commented 3 years ago

I see that changing the offset type per buffer (effectively forking the format) is a lot easier, but I think long-term FlatBuffers would be better off by supporting 64-bit (and maybe 16) offsets natively as part of a default-32-bit format. The advantages of having a single format are pretty big. So I'd like to pursue that first, and only resort to format forks if we have evidence that it is impractical.

TYoungSL commented 3 years ago

When do you think work would start on that?

How do you propose supporting 64-bit with default-32-bit should proceed?

Is the task just up for grabs?

aardappel commented 3 years ago

@TYoungSL

When someone decides to work on it.

Like this: https://github.com/google/flatbuffers/projects/10#card-14545298 though we could discuss it in more detail.

Yes.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 6 months with no activity. Please comment or this will be closed in 14 days.