Possible design for 64-bit sized buffer support in FlatBuffers

aardappel commented 2 years ago

"2GB should be enough for anybody!" I said some 9 years ago when first coming up with FlatBuffers. Of course, even back then OSes were already 64-bit and machines had memory for bigger buffers than that, but because of the nature of FlatBuffers (largely immutable data written linearly) it seemed that if people would store >2GB of FlatBuffer data, that would be most efficient in multiple files (packets / buffers / ...) anyway. And of course 32-bit offset made the format a whole lot more compact.

Turns out, people nowadays have actual legit use-cases for storing more than 2GB in a single consecutive FlatBuffer:

A typical use case is some form of in-memory streaming of objects / records / logs coming in over the network, which get written into a FlatBuffer. While each record is typically <2GB, it is kind of artificial to have partition this log of them at 2GB boundaries, when many server machines could accumulate much more than that before running out of memory.
Some people have binary data coming from outside of FlatBuffers that they'd like to store inside of a FlatBuffer vector, and 2GB is sometimes restrictive there too.

Now, what would seem simple is to simply fork the entirety of FlatBuffers and have new files where every offset is now 64-bit. Given the amount of language implementations that we have, and the amount of code locations that would need touching to become 32/64 agnostic this would actually be a huuuge amount of work, and unrealistic to expect to be able to implement with volunteer maintainers. Also, a fork is not desirable, for many use case a buffer that may contain 64-bit sized items but is still largely compatible with the 32-bit format is much nicer.

So, here's a possible design that aims to enable the above use cases while causing the absolute least amount of churn for FlatBuffers implementations, and making support as easy as possible. This may come at some cost of generality, but since 64-bit buffers are still a bit niche I think this is the right tradeoff.

The idea is to only have 1 new "type variant": 64-bit sized scalar vectors.

Such a vector is designated by using the vector64 attribute that may only be used on table fields. That table field will be implemented using a 64-bit offset to a vector with a 64-bit size field.

The entire rest of FlatBuffers will stay 32-bit as before.

In terms of serialization, think of any buffer having a head and a tail. The head is the traditional FlatBuffer, the tail (what you serialize first) may ONLY consist of 64-bit vectors, thru FlatBufferBuilder::CreateVector64 or similar. The builder can easily keep track that the tail is only such vectors (and assert if not), and that all the rest of the data (the head) still fits in 2GB (so all traditional offsets work). The offsets created in the tail may only be stored in vector64 fields in the head.

Now you may wonder how this helps with the first use case, where each record may have a variety of data in it (nested tables/strings..). The idea is that these will be stored as a nested_flatbuffer. This solves several problems, first it means that this record is forced to be self-contained, meaning it doesn't share strings or anything with the main buffer, which would now also need to become 64-bit, complicating the design significantly. It also allows us to restrict the supported 64-bit types to just scalar vectors.

As an example:

table Root {
  my_records:[SubWrap];
  my_gigantic_buffer:[ubyte] (vector64);
  my_normal_vector:[float];
}

table SubWrap {
  rec:[ubyte] (nested_flatbuffer: Sub, vector64);
}

table Sub {
  // may contain absolutely anything.
}

So here, my_records indicates the first use case, and needs a table to wrap the vector. That doesn't seem ideal, but since we're typically referring to large vectors, this overhead should be minimal. my_gigantic_buffer is the 2nd use case, where we simply want to store a binary blob >2GB.

To serialize, you'd first create vectors for each of these [ubyte], followed by a normal buffer storing all the references to it. Fields like my_normal_vector would need their vector to be serialized after the 64-bit vectors.

In terms of forwards/backwards compatibility, we'd modify the schema parser such that initially occurrences of vector64 errors for any language backend that doesn't have support for it, much like we've done for new union related features in the past. Since using vector64 means adding a new field, we can thus never have the situation where an old implementation accidentally tries to read a 64-bit vector offset as 32-bit.

Then, language can incrementally add support for these new 64-bit vectors, which should be relatively straightforward, e.g. for C++ this CreateVector64 function does most of the work of tracking the use of these vectors, would return a new offset type incompatible with 32-bit offsets, and serialization functions for these new fields would only accept that offset type. Similarly, when reading, we will now have a Vector64 type separate from Vector for access.

A language that can't spend the time for a full implementation but wants to be compatible with schemas that use vector64 could simply add a quick check for the vector64 attribute and then omit accessors for that field, and would thus generate code that can still access the rest of that buffer (or simply buffers that don't happens to actually store any 64-bit vectors).

Original notes on 64-bitness: https://github.com/google/flatbuffers/projects/10
Discussion in "FlatBuffers2" binary format: https://github.com/google/flatbuffers/issues/5875

WDYT?

@rw @mikkelfj @vglavnyy @mzaks @mustiikhalil @dnfield @dbaileychess @lu-wang-g @stewartmiles @alexames @paulovap @AustinSchuh @maxburke @svenk177 @jean-airoldie @krojew @iceb0y @evanw @KageKirin @llchan @schoetbi @evolutional

CasperN commented 2 years ago

This would help with #7536, but I think it seems a little "special cased" for a language feature with changes to the schema language and everything. I think it's better to provide this as a library, we can provide some BigFlatbuffer<Root, Sub> type with its own builder, and reader classes. Doing this as a library would also decrease the maintenance and implementation burden.

aardappel commented 2 years ago

@CasperN I am not sure how you'd "do it as a library", given that this needs features in the schema parser, the code generators, the runtime code, etc. It be possible to do it in the runtime only but it be very unergonomic, and work differently in every language.

And yes that linked issue is pretty much my use case #1

CasperN commented 2 years ago

I am not sure how you'd "do it as a library"

So the binary format would be

[size prefixed table][length and vector of 64bit offsets][many tables]

The first size prefixed table on the left would correspond to Root (but without the my_gigantic_buffer field) while the many tables at the end correspont to Sub. This is almost the same as the binary format that you're proposing. The difference is that the big vector is not "inside" of the root table.

Without knowing the types, a BigFlatbufferReader can find the root table, read the vector of offsets, and find each sub table. Though we should parameterize it with Root and Sub to offer typed access to the root and sub tables. This does not require changes to code-gen or the schema language.

Making a BigFlatbufferWriter is tricker and would require modifications to flatbuffer builders, but similar modifications are required in the builders no matter how we choose to implement this.

CasperN commented 2 years ago

Basically I'm suggesting adding a new kind of implicit "root", there's

the normal root (offset to the first table)
size prefixed root
size prefixed root but with 64bit vector to more tables afterwards

AustinSchuh commented 2 years ago

Which brings up the question, we use size prefixed flatbuffers quite a bit. How would those work with your proposal?

Since 2 GB means that the highest bit is available, maybe use that to signal that this is a 64bit flatbuffer inside the size prefixed flatbuffer?

aardappel commented 2 years ago

@CasperN vector of 64bit offsets with many tables afterwards was essentially my original suggestion in the linked "projects" page. But that design has many problems, in particular around avoiding sharing of vtables, and any nested/shared objects, which would generate more 64-bit offsets needed inside the many tables. It also makes it harder to force ordering in the buffer. To that end, I simplified the design to only support nested_flatbuffer which eliminates all those problems in one go.

I also don't understand the relevance of size prefixed.

@AustinSchuh yes, sadly, the buffer as a whole would need a new kind of "64-bit size prefixed".. it may be good to set the high bit, but users were already required to statically differentiate between size prefixed and not, so this new type is no different. And for the nested buffers there is no problem, since they were always sized anyway, and will now be automatically 64-bit sized.

CasperN commented 2 years ago

I agree that the many tables should be independent (no sharing between elements of the big vector) to avoid those problems. The relevance of size-prefixed is that it tells you when the “root” flatbuffer ends and when the table of offsets begin. The “size” refers to the size of the table of type Root. The byte after that is the start of the vector of offsets.

The only difference between my suggestion and the original post is the observation that if the big vector is “top level” and not inside of the table, Root, then this can be implemented with even less effort.

Actually the simpler version is to not distinguish between the root table and the sub tables. Consider the following format:

• the first 8 bytes is the uint64 length in bytes of everything • Second 8 bytes are the unit64 number of elements in the vector • Then the uint64 offsets to the nested flatbuffers • Then the nested flatbuffers

Applications can choose to make the 0th element a special “Root” table and all other elements a “Sub” table On Sep 19, 2022, 18:49 -0400, Wouter van Oortmerssen @.***>, wrote:

@CasperN vector of 64bit offsets with many tables afterwards was essentially my original suggestion in the linked "projects" page. But that design has many problems, in particular around avoiding sharing of vtables, and any nested/shared objects, which would generate more 64-bit offsets needed inside the many tables. It also makes it harder to force ordering in the buffer. To that end, I simplified the design to only support nested_flatbuffer which eliminates all those problems in one go. I also don't understand the relevance of size prefixed. @AustinSchuh yes, sadly, the buffer as a whole would need a new kind of "64-bit size prefixed".. it may be good to set the high bit, but users were already required to statically differentiate between size prefixed and not, so this new type is no different. And for the nested buffers there is no problem, since they were always sized anyway, and will now be automatically 64-bit sized. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

BrianHarris commented 2 years ago

We needed >2gb support and ended up doing something very similar. Our format was: [4 byte file identifier] [number of offsets] [vector of 64 bit offsets] [regular flatbuffer]

The regular flatbuffer had normal int fields with an index into the offset table.

We used the file identifier to detect if this was a plain flatbuffer or our custom format.

aardappel commented 2 years ago

@CasperN yes, but this makes it into a very "custom" format.. my original design tries to keep things as similar to existing FlatBuffers as possible, and gives some flexibility as to where references to these big objects/vectors are stored (doesn't have to be a in a single top level vector). That little flexibility comes at almost no cost.

@BrianHarris cool, yes, that is essentially a container format (PAK file? :) around smaller FlatBuffers. It differs from what I am proposing that I want to use regular FlatBuffers for the "directory" part (and also allow larger than 2GB constituent parts).

CasperN commented 2 years ago

but this makes it into a very "custom" format.. my original design tries to keep things as similar to existing FlatBuffers as possible, and gives some flexibility as to where references to these big objects/vectors are stored (doesn't have to be a in a single top level vector). That little flexibility comes at almost no cost.

I actually disagree whether the cost/flexibility trade off is worth it. I think the only flexibility benefit is the ability to access the big vector via the root table (or one of the tables you can access via the root table). It's also nice that there isn't a proliferation of "root access types" (we'd have get_root<T>, get_size_prefixed_root<T>, and now get_big_vector<T>). However those ergonomic benefits are imo not worth the cost of multi-language code generation because it's equivalent to @BrianHarris's solution with a little sugar to how the big vector is accessed.

I think the cost would be worth it if this feature comes with a roadmap for more 64bit flatbuffers features: e.g. any table can have a 64bit offset to a big vector, nested flatbuffer, string, or a big string. In that context, I can absolutely see why the codegen must be aware of 64bit offsets and we'd want a uniform design for the different use cases. However, we're not going for such a large project.

BrianHarris commented 2 years ago

I agree with Casper that this can, and probably should, be done as a library/documentation/sample code. At least until "FlatBuffers2"

One of the problems with the approach I took is writing out the data requires knowing about all the data blobs ahead of time, so we can pre-allocate the vector at the beginning of the buffer. It also requires us going back and filling in the offsets once we know the final size of the flatbuffer. This wasn't a problem for us because we never streamed this data out while writing it, so seeking on disk wasn't a concern. A nice feature of flatbuffers though is it's written depth-first, so you never have to seek backwards when writing, nor do you have to scan your source data to pre-calculate anything. Moving the vector to the end of the buffer would solve this problem though: [flatbuffer] [large data] [vector of 64 bit offsets] [number of offsets]

You could find the root table by looking at the first offset (or if "number of offsets" is 0 then just looking immediately before it).

yeswalrus commented 2 years ago

Just chiming in - I'd love to see this implemented. At 2 different companies now I've seen custom hacks added specifically to support >4gb binary blobs contained within flatbuffers streams. One just wrote a metadata header that contained the size of the blob & wrote it directly to the stream after the flatbuffers message, leaving it to the streaming code to do the right thing. The other forked flatbuffers and made it possible to enable 64bit offsets on a per-schema basis, which is wasteful, but functional.

mikkelfj commented 2 years ago

I think 64 bit should just be the same format compiled with uoffset_t and soffset_t as 64 bit offsets. It relatively trivial to do, offers complete freedom for the end user, and the file format would at most be twice as much as you need to, which probably is of less concern at those scales (although it will affect cache locality and thus performance).

Starting to add custom partially large data structures quickly gets complex both for the library and the end user. I have looked into a number of ways of doing it and never really found anything satisfying. It is similar to the jump from 8/16 bit memory banks, then x86 segmented 16 bit memory then 32-bit memory, then 64 bit memory. We never had segmented 32 bit memory but jumped straight to 64 bits (albeit CPU's don't actually address the full 64-bit range on the internal bus).

Should we add an option for 64 bit, there would be need to be a format identifier bit and I would not expect the same generated code to work with both 32- and 64-bit formats. In fact, flatcc has prepared namespaces to generate code for different formats.

I also suggest looking into StreamBuffers at the same time so at least readers are guaranteed to be able to read front to back generated buffers. This is a bigger change, but more important for larger buffers. It means you can write and transport buffers without keeping all in memory.

https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#streambuffers

mikkelfj commented 2 years ago

Also, I can add that I do work with a large format read only database in FlatBuffers where the index is generated in a separate FlatBuffer linking to a number of on-disk buffers each limited to 2GB and mapped into memory as needed. This works fine, and I'm not sure a single 64-bit buffer would easily replace it due to file size, cache, memory and transport concerns, but it also isn't something you build in one day.

CasperN commented 2 years ago

I think 64 bit should just be the same format compiled with uoffset_t and soffset_t as 64 bit offsets. It relatively trivial to do,

I don't think we've invested into testing this very thoroughly, unfortunately. As @aardappel noted, "Given the amount of language implementations that we have, and the amount of code locations that would need touching to become 32/64 agnostic this would actually be a huuuge amount of work, and unrealistic to expect to be able to implement"

One of the problems with the approach I took is writing out the data requires knowing about all the data blobs ahead of time, so we can pre-allocate the vector at the beginning of the buffer. It also requires us going back and filling in the offsets once we know the final size of the flatbuffer.

If we're writing logs to disk in a flatbuffer format, it'd be nice if we wrote things front-to-back so we can end the file with the vector of offsets and metadata. This probably can't be integrated in the normal flatbuffer builder though, and probably would be bad for streaming settings.

Some people have binary data coming from outside of FlatBuffers that they'd like to store inside of a FlatBuffer vector, and 2GB is sometimes restrictive there too.

@aardappel I just thought about this one. How do you expect having a large flatbuffer vector to help with this? Presumably one of the big vector's elements would be a >2GB blob of external-to-flatbuffers data?

Also, I just realized in @aardappel's original proposal, there may be more than one 64bit vector, so long as all of them are at the end. That is indeed different from the solution I and @BrianHarris were suggesting which only supports one large vector.

I think we need more consensus on canonical use cases before we can shrink the design space...

aardappel commented 2 years ago

@CasperN

if we wrote things front-to-back

Yup, totally see that.. but a difficult problem to solve and likely independent of 64-bit support.

Presumably one of the big vector's elements would be a >2GB blob of external-to-flatbuffers data?

One 64-bit vector would be such a blob (as opposed to an element), see my_gigantic_buffer in the original post.

there may be more than one 64bit vector, so long as all of them are at the end

Yes, I think that is crucial for some level of flexibility in what kind of things can be implemented with this.

mzaks commented 1 year ago

I have a suggestion for a more aggressive, but IMHO also more valuable change. Replace offsets from being i32 to a special type of LEB. This special type of LEB encodes the size of the number in the first two bits of the first/leading byte. Where:

0 -> 1byte
1 -> 2bytes
2 -> 4 bytes
3 -> 8 bytes

The decoding logic looks something like following:

let b = reader.readByte()
let byteLenght = b & 3
if byteLenght == 0 {
  offset = b >> 2
} else if byteLenght == 1 {
  offset = reader.readShort() >> 2
} else if byteLenght == 2 {
  offset = reader.readInt() >> 2
} else {
  offset = reader.readLong() >> 2
}

This LEB type let us work with padding and memory alignment, as the values are always 1, 2, 4, or 8 bytes long.

The length of buffers and strings could also be encoded with such LEB.

Same for vector of scalar values.

For vector of offsets, it would be better to use the same pattern, but shift it by another 2 bits. First two bits encode the byte size of the vector values, next two bits encode the byte size of the vector length. The offsets in the vector I stored as u8/u16/u32/u64, based on the max value in the vector.

A flag in the schema could identify if the new LEB values are used throughout the buffer and not only in some special places. This change will reduce buffer size for small buffer and will make almost infinite size buffers (62 bit pointers length and 60 bit vector size) possible.

mikkelfj commented 1 year ago

Not saying yes or no, but if you want to do LEB, you might want to follow QUIC https://www.rfc-editor.org/rfc/rfc9000.html#integer-encoding I think it is roughly as you suggest, but endian concerns might have to be addressed.

mzaks commented 1 year ago

Not saying yes or no, but if you want to do LEB, you might want to follow QUIC https://www.rfc-editor.org/rfc/rfc9000.html#integer-encoding I think it is roughly as you suggest, but endian concerns might have to be addressed.

Ah right, QUIC was my initial inspiration. I think in case of FlatBuffers where we use little endian representation explicitly, my suggestion is a bit more ergonomic. As we consume the bytes left to right, the byte size encoding should be in the first (less significant) byte. And encoding the byte size in the left side of the bit pattern of the first byte, will make the reading part of the value more cumbersome. Please let me know if I am mistaken.

aardappel commented 1 year ago

I'm a big fan of variable length encodings in general, but the problem is that pretty much all such formats, no matter how clever, as still significantly slower than a simple 32/64-bit read, to the point that it would not be suitable for FlatBuffers.

It be great for a FlatBuffers-like format that was optimized for size but still random access, but not for all offsets in FlatBuffers.

FlexBuffers of course has them, and their I tried to mitigate some of the inefficiency of decoding the payload size in every value by only doing it once for every vector, but even there the problem is that there is just no fast CPU instructions for reading a variable length integer with a dynamic size, beyond movsb on some platforms: https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flexbuffers.h#L164 Would need some benchmarking to see if that can be made fast on all platforms (does ARM have something similar?).

Even the slow conditional variable size offset decoding is appropriate for FlexBuffers since it is a slower to access format anyway (field access goes through a binary search!), but not so much for FlatBuffers.

On top of that of course is that the motivation for this issue (unlike the "FlatBuffers2" thread) is to extend the existing format in a backwards compatible way, which this wouldn't do.

mzaks commented 1 year ago

Actually the reading of a variable length offset can be quite efficient as we have a simplified problem on our hands.

The runtime value can be UInt64, the buffer is aligned and padded, so decoding of the value can look something like this:

let masks = [0xff, 0xffff, 0xffffffff, 0xffffffffffffffff]
let offset = reader.readUInt64()
return (offset & masks[offset & 3]) >> 2

I also think that it would be possible to introduce the feature in a backwards compatible way. Forward compatibility is a different story though.

mikkelfj commented 1 year ago

I have tested variable length encodings in various contexts, including a variant of robin hood hashing where an offset needed to be stored. In general the performance is rather poor if it is in a critical loop.

mzaks commented 1 year ago

@mikkelfj most variable length encodings rely on loops and conditions for decoding. My proposal is branch less. It is still definitely slower than just direct four byte read, but as performance penalties go, it should be not too bad. Having smaller length and offset values will produce smaller buffer and increase data locality, which in term might speed up overall performance. That sad, it is all theoretical until someone actually implements and profiles it.

mikkelfj commented 1 year ago

but as performance penalties go, it should be not too bad.

All I say is test it. I was surprised. I scrapped an entire optimized algorithm design and implementation because it wouldn't fly. It goes from sub-ns to 10 ns or so depending.

I another test I found that nested branching is much faster than lookup if you have a choice.

mzaks commented 1 year ago

Totally agree regarding testing.

And I am also not so sure regarding lookup performance.

That sad, the mask can also be computed purely arithmetically:

mask = 0xffffffffffffffff >> ((8 - (1 << (offset & 3))) << 3)

So one can chose between branching, lookup and arithmetics.

mikkelfj commented 1 year ago

Note that I also have concerns regarding compatibility and implementation effort, but I'm focusing on the technical / performance aspect in the above, given that this is what is primarily differentiating flatbuffers from msgpack and protocol buffers.

mikkelfj commented 1 year ago

There is also a technical concern regarding unaligned access, and reading "too much ahead" which could be necessary for performance. In other contexts, when parsing, I often add some zero bytes at the end of the buffer for reading ahead, but this isn't really practical here.

In terms of encoding, I followed and participated in QUIC v1 design and the change to using a single uniform VLE representation greatly simplified many aspects compared to various bit flags in many header fields that were in the early design. But here you also have to carefully deal with buffer overrun and unaligned access.

mikkelfj commented 1 year ago

Conditional assignment and arithmetic would usually beat any alternative unless the overhead is excessive. Sequential dependency chains like a = 1 | x; b = a >> 1 | 1; c = b >> 1 | 1; ... (just random pointless logic) are typically faster than alternatives. This is used in integer log2 algorithms.

C does not have (branchless) conditional assignment, but a = x ? 1 : a; will typically optimize to a cmov like instruction.

dbaileychess commented 1 year ago

I would like to prioritize this to be the top feature to work on for 2023.

I think the ideal end state would be what @mikkelfj suggests: uoffset_t and soffset_t as 64 bit offsets, with additional types for strings/vectors size parameters to support large sizes as well. This would be the most flatbuffer-like implementation and mostly agnostic to the end user (since these are internal details). We could even explore the variable length encoding with this as @mzaks brings up, if we are concerned with buffer-bloat due to unused bits in oversized offsets. Though, I see variable encoding as a potential optimization that isn't strictly needed. This would involve the most changes to everything and would take a lot of work to do correctly.

However, that being said, @aardappel OP design seems straightforward and does avoid many complications with the nested_flatbuffers. Though this approach does involve collusion of the end user to be explicit in defining their schema for larger buffer support, as well as serializing the data in the correct order. I do like the option that non-supported languages can simply skip these 64-bit aware fields and still provide some level of compatibility.

Lastly, we have @CasperN and @BrianHarris idea of implementing this as a library, with some sort of meta-format file that manages multiple independent flatbuffers. This might have the least amount of changes necessary, but also in the least flexible and most rigid of the three proposals.

Things to think about that I am just laying out.

Every vtable has a uint16_t "size of the referring table" field that is unused by all implementations. Can we use this to encode some information we can use to dynamic switch from 32/64-bit mode?
Do we intend to make Tables support offsets larger than 2^16-1 bits? Currently the vtable uses VOffset16 to indicate the field offset relative to the table start offset.
The first 4 bytes of all flatbuffers is the UOffset32 to the root table, however, it is impossible to have this point to an offset at 0xFFFFFFFF, since Tables require their first 4 bytes be the offset to the vtable (SOffset32). That leaves the two upper bits as always set to 0 (for valid, non-malicious buffers). Can we use these bits as a way to version the flatbuffer to switch between 32/64-bit modes?

mikkelfj commented 1 year ago

The first 4 bytes of all flatbuffers is the UOffset32 to the root table, however, it is impossible to have this point to an offset at 0xFFFFFFFF, since Tables require their first 4 bytes be the offset to the vtable (SOffset32). That leaves the two upper bits as always set to 0 (for valid, non-malicious buffers). Can we use these bits as a way to version the flatbuffer to switch between 32/64-bit modes?

I think we could use this as an escape, but not for deciding the final format. I think we need an extra field for that. Also, the ambiguity regarding presence or absence of file identifiers is a bit of a nuisance. So I suggest we make that more explicit. With the extended format identifier we could have a format for the classical FB 32 bit format which would also be the default without a format identifier. Also, this would be a good place to enable StreamBuffers if they ever are to exist.

EDIT: the optional size prefix is also a point of confusion.

This leads to:

Do we intend to make Tables support offsets larger than 2^16-1 bits? Currently the vtable uses VOffset16 to indicate the field offset relative to the table start offset.

I think this shouldn't be a big problem, but large vtables could be a problem, also for CPU cache. However, if we have a format identifier, we could sneak in support for this if we really need to. I.e. I think this is an orthogonal problem we do not need to address at the same time.

Every vtable has a uint16_t "size of the referring table" field that is unused by all implementations. Can we use this to encode some information we can use to dynamic switch from 32/64-bit mode?

I have a long running idea of a) extending flatbuffers with Mix-In inheritance. There might be some optimizations that could use that field. b) Proper NULL values could be encoded with an optional bitmap that could be encoded there. In general such options (I know that proposal isn't popular, but it is an example). In general we could have flags to encode certain encoding extensions. It could also be used for dynamically extending the representation as you suggest. In either case I'm not really in favor of conflating this with 64-bit global format - both due to the complexity of it, and for performance reasons in the common case. But I do like the idea of reserving this field for flags of some sort. Again, this could be enabled via my proposed format identifier such the vtable field is only special in special FB formats.

Finally, it is worth noting that the Arrow format also exists. It uses FB for metadata but a Google Drill like mix of row/column data that is better suitable for more homogenous (but potentially sparse) data in large volumes.

aardappel commented 1 year ago

@dbaileychess I think first thing to decide is how much engineering time you and any collaborators would have for this. If that is a lot, then there are options. I just think routing support for possible 64-bit offsets thru all APIs of all languages is a tremendous engineering exercise that I personally wouldn't want to bet on.

My suggested approach marries low engineering effort with enough escape hatches for people to use the 64-bit space. Escape hatches are ugly, but remember this is a niche requirement, 99.9% of FB users really don't need more than 32-bits.

1) most implementations will write this field a 0 and don't read it, but some may write an actual size? How do any of the verifiers handle it? Anyway, I'd recommend against allocating bits in the format, first since you'd not want extra branches (even if they're cheap in code/execution, it all matters and hurts inlinability), I'd generally recommend a schema driven design.

2) vtable size seems even less pressing than 64-bitness for the whole buffer. And if you'd have a design that allows 32-bit vtables, you'd probably want in-line offsets (i.e. 1 less offset jump for strings/vectors/tables). And then you want inline int/float.. and before you know it, you have a completely different design :)

3) much like 1), really would prefer new schema feature over bit-hacks.

mikkelfj commented 1 year ago

I'd recommend against allocating bits in the format, first since you'd not want extra branches (even if they're cheap in code/execution, it all matters and hurts inlinability)

I would tend to agree, but I think could be solved by just detecting if you have a supported format or not. Branching after opening the file / buffer would not be good in general.

dbaileychess commented 1 year ago

most implementations will write this field a 0 and don't read it, but some may write an actual size? How do any of the verifiers handle it? Anyway, I'd recommend against allocating bits in the format, first since you'd not want extra branches (even if they're cheap in code/execution, it all matters and hurts inlinability), I'd generally recommend a schema driven design.

Yeah, i haven't audited the path to see if it is ever used. Was just a suggestion to throw in the wind :)

vtable size seems even less pressing than 64-bitness for the whole buffer. And if you'd have a design that allows 32-bit vtables, you'd probably want in-line offsets (i.e. 1 less offset jump for strings/vectors/tables). And then you want inline int/float.. and before you know it, you have a completely different design :)

Agreed, I don't think it is important to expand the size of vtables, I just wanted to bring it up if we had something that could tackle dynamic switching of offset sizes, as it could possible be applied to vtables as well.

much like 1), really would prefer new schema feature over bit-hacks.

Agreed.

My suggested approach marries low engineering effort with enough escape hatches for people to use the 64-bit space. Escape hatches are ugly, but remember this is a niche requirement, 99.9% of FB users really don't need more than 32-bits.

Yep, I think for the cost-benefit tradeoff, your or the library approach is the best compared to the "native" 64-bit support.

machineko commented 1 year ago

Are there any progress on implementing 64-bit size in the 2023 or is this idea still in design phase?

aardappel commented 1 year ago

This is actually in progress: https://github.com/google/flatbuffers/tree/flatbuffers-64

machineko commented 1 year ago

Nice I can only hope it'll make master soon as it would be great utility for saving bigger files 🏅

aardappel commented 1 year ago

@machineko details of your usecase?

machineko commented 1 year ago

Deep learning models serialization and transfer around the distributed system (sometimes with weights >20GB which right now is using json and <2GB using flatbuffers)

norakh commented 1 year ago

Just wanted to say big thank you to all who are working on supporting 64 bit flatbuffers! I was caught up in different projects and am now returning to the project from #7536 (hopefully I can make it public in the next 2-3 weeks!), where I already thought of design changes for my library to partition the flatbuffers into < 2 GB chunks.

However, I will try out https://github.com/google/flatbuffers/tree/flatbuffers-64 first for sure, since partitioning would complicate things for future users of my library a lot. I wanted to expose as little as possible from flatbuffers, since my library will probably be used by people with no knowledge on serialization and "low-level things". To give you a rough hint, the most popular file format to describe programs in my area is this one here: https://homes.esat.kuleuven.be/~nsmart/MPC/. I definitely want to change this :D

dbaileychess commented 1 year ago

I would like to open a invitation for people to start trying out the 64-bit work in the https://github.com/google/flatbuffers/tree/flatbuffers-64 feature branch.

There is still stuff for me to do in it, but at this time I feel like most of the features are implemented and tested to a degree. I've only tested with clang and gcc on linux, I believe the mac and windows errors are just about narrowing of types that I will address.

I don't have docs/tutorial yet, but I hope the tests and schema are straightforward to get the gist.

Only works for C++ generators at the moment.
You are require to use a different builder, but it has the same semantics.

FlatBufferBuilder64 builder;

You're required to serialize ALL the 64-bit offset field first, before adding any normal 32-bit fields.
Only supported for strings and vectors, tables are not.

dbaileychess commented 1 year ago

The main difference in the serialized buffers is the presence of 64-bit offsets (example) which were always 32-bit before.

Additionally, there is a new vector type (vector64) that uses a uint64_t sized length field (example) instead of a uint32_t (example)

dbaileychess commented 1 year ago

Everything is Green on that branch for now, and I'll try my best to keep it that way until we merge it in.

jblazquez commented 1 month ago

@dbaileychess, thanks so much for this work! I find myself in need of 64-bit offsets, but I'm running into the current limitation of vectors of tables not supporting the offset64 attribute. I found this commit of yours that appears to be lifting that restriction, but it doesn't seem like it was ever merged.

I was wondering if support for faraway vectors of tables is being considered for the future, or whether there is a different option currently that would achieve something similar?

dbaileychess commented 1 month ago

I would have to double check, but I thought I put everything in and didn't have any abandon work.

jblazquez commented 1 month ago

Looking at the commit again, I think I misunderstood: it adds support for 64-bit vectors of structs, but it seems 64-bit vectors of tables were never supported.

I believe I've been able to update my schemas to support larger buffers, taking advantage of nested_flatbuffer to split up the buffer into multiple sub-buffers, each of which is limited to 2GB but overall giving me some headroom for now. I haven't fully tested it but the schema compiler is happy with it.

I basically went from something like this:

table Root {
    foos:[Foo];
    bars:[Bar];
}

table Foo {
    x:int;
    y:int;
    z:string;
}

table Bar {
    a:int;
    b:string;
}

To this:

table Root {
    foos:[ubyte] (offset64, nested_flatbuffer: "Foos");
    bars:[ubyte] (offset64, nested_flatbuffer: "Bars");
}

table Foos {
    foos:[Foo];
}

table Foo {
    x:int;
    y:int;
    z:string (offset64);
}

table Bars {
    bars:[Bar];
}

table Bar {
    a:int;
    b:string (offset64);
}

I still can't have a massive number of foos or bars, but by moving each section of the overall buffer to its own nested buffer, and by serializing offset64 strings and vectors first so they end up at the end of the buffer, I can fit many more tables than I could before I overflow the 32-bit offsets.

mikkelfj commented 1 month ago

So this is why I argued it would be better to just have a pure 64-bit version of FlatBuffers.

google / flatbuffers

Possible design for 64-bit sized buffer support in FlatBuffers #7537