Mechanism to support different offset (uoffset_t) size per table

TYoungSL commented 3 years ago

Per my https://github.com/jamescourtney/FlatSharp/issues/158#issuecomment-869063251 leading to @aardappel's plans for extending FlatBuffers to be 64-bit capable and/or creating FlatBuffers2, an attribute per table to change the encoding to a non-standard offset size may be a useful respite until official action is taken to provide support.

table LargeTable (fs_offsetSize:8) { // long sized uoffset_t
  someLargeVector:[ulong]; // can be > 1GB, > 2GB, etc.
  anotherLargeVector:[ulong];
}

table SlightlyLargerTable (fs_offsetSize:5) { // throw not supported or truncated long
  someLargeVector:[ulong];
  anotherLargeVector:[ulong];
}

table SmallTable (fs_offsetSize:2) { // short sized uoffset_t
  someTinyTable:[byte]; // should fail to encode if length > 32767ish?
}

table TinyTable (fs_offsetSize:1) { // byte sized uoffset_t
  someTinyTable:[byte]; // should fail to encode if length > 127ish?
}

Of course such an implementation would not be expected to be binary compatible with other implementations, but keeping it simple might allow it to be compatible with future spec.

An implementation of variable sized ints is not meaningfully useful for the additional code required at this time, but they could also be done in a similar fashion.

table VarIntOffsetTable (fs_offsetVarSize) { // use LEBs instead of fixed-size uoffset_t
  someTinyTable:[ulong]; // can be arbitrarily large
}

They were discussed as a proposal for FlatBuffers2. They are a flexible alternative solution to achieve different offset sizing, but the additional code may not be necessary to achieve the same goals and is unlikely to provide any real world benefits over explicitly specified per-table offset sizing.

jamescourtney commented 3 years ago

Thanks for the thoughtful comments. This is something I'd love to see happen, but the line for FlatSharp right now is not doing anything to cause binary incompatibility with the canonical library.

So the starting point for this needs to be a proposal in the official FlatBuffer repo. If you don't get traction there, then I think it's worth considering if there are sane ways FlatSharp can do extensions such that people can't mix standard and extension formats.

One idea for this is a set of directive statements at the top of the file:

FlatSharp.Directive OffsetSize = 8;

Ignoring the issue of imports, the thing I like about this is that it will cause a hard break with flatc, so it won't be possible to compile a schema there if this directive is also specified.

TYoungSL commented 3 years ago

@aardappel wants to do it for FlatBuffers2;

From https://github.com/google/flatbuffers/issues/5875#issue-607087371

Support 64-bit offsets from the start. They would be optional for vectors and certain other things, allowing buffers >2GB. See https://github.com/google/flatbuffers/projects/10#card-14545298

There's plenty of support, but not immediate concern. I'll create an issue for it if I can't find relevant discussion in the issue regarding specifying offset size (https://github.com/google/flatbuffers/issues/5471).

Yes, a parser-breaking extension to the FBS grammar would be a good idea to break unintentional apparent compatibility.

I'd propose something that would be more possible for other implementations to generically support. E.g. along the lines of directive FlatSharp OffsetSize = 8; or similar; such that the grammar can be expressed with a definite start (directive), namespace token (FlatSharp here), termination token (;), in the vein of #pragma. Anything unsupported by the implementation attempting to support it would break with a better explanation than just a line number.

table LargeTable {
  directive FlatSharp OffsetSize = 8; // long sized uoffset_t
  someLargeVector:[ulong]; // can be > 1GB, > 2GB, etc.
  anotherLargeVector:[ulong];
}

In a compiler that is attempting to support directives generically;

Unsupported table directive in LargeTable: FlatSharp OffsetSize = 8

In FlatSharp if OffsetSize is unsupported;

Unsupported FlatSharp table directive in LargeTable: OffsetSize = 8

In a compiler that is not attempting support;

Unknown member directive in LargeTable, line 2

A word other than directive may be appropriate, maybe something implying more requirement than option.

jamescourtney commented 3 years ago

I'm interested to see if the FlatBuffers2 proposal has any real traction or is just a wish list. Thanks for asking. I read it over a year ago and thought "that sounds neat!", and I still think that! It really would be a meaningful improvement to the format in a lot of ways.

With respect to FlatSharp, implementing "mixed-mode" offset sizes is going to be hard. This is due to implementation details in FlatSharp where it wants to write a single method for each CLR Type, so if there is a method to serialize TableA with 4 byte offsets and a different method for 8 byte offsets, that's going to invalidate a lot of assumptions. Not sure if that's better or worse than having a global offset directive.

aardappel commented 3 years ago

I'm interested to see if the FlatBuffers2 proposal has any real traction or is just a wish list.

A wish list.

jamescourtney commented 3 years ago

I'm interested to see if the FlatBuffers2 proposal has any real traction or is just a wish list.

A wish list.

Appreciate the response, @aardappel!

jamescourtney commented 3 years ago

@TYoungSL, I'm really of two minds about this.

The first half says that large buffers would be useful, and adding them in a non-dangerous (ie, incompatible with flatc) way would be a fun project.

The other half says that at the end of the day, FlatSharp is an implementation of FlatBuffers, 2GB limit and all. So while I can toe the line of the FlatBuffers format with various features (type facades, indexed vectors, etc), binary compatibility and correctness of the format are king, because much of the usefulness of the project derives from those two things.

The parts that concern me most are:

The additional test/maintenance overhead. If I commit to supporting "big buffer" mode in perpetuity, do I need to double my test cases?
The fact that once this is added, FlatSharp is really doing two things: FlatBuffers and something that looks a bit like FlatBuffers.
People (including myself at Microsoft) actually use this library to run real systems, and one of the promises I do my best to fulfill is that I will not write malformed data.

TYoungSL commented 3 years ago

If ultimately we have data structures that exceed 1-2GB, perhaps FlatBuffers was a poor choice.

jamescourtney commented 3 years ago

If ultimately we have data structures that exceed 1-2GB, perhaps FlatBuffers was a poor choice.

Most serialization formats are going to be poor choices if your goal is 100GB! Protobuf has a hard limit of 2GB. The only one I know of that supports 2^64 is Cap'n Proto, but I've not personally used it or the C# implementation.

Based on my admittedly limited knowledge of your problem space, I think you have a few viable paths forward, but they all involve you writing some code.

Write an indirection wrapper on top of your root FlatBuffer type. I don't think you ever said that your arrays of structs had to be contiguous, so have 100 1GB files and just use some modulo arithmetic to figure out the mapping of index -> file. Matrices might be slightly trickier, but it seems like you could divide up the larger matrix into many smaller ones. You can then wrap this in some object that exposes an indexer that accepts a long and presents a unified view to the world. Then your logic can operate mostly independently of the actual file structure of the data. I think you maintain your ability to to SIMD here as well.
I don't think you ever mentioned having interop be a concern, so fork FlatSharp and use it as the basis for a FlatBuffer-inspired data format of your own design. This would allow you to support huge buffers natively, add first-class support for System.Numerics types, and other things that your use cases require. I'd be happy to consult with you on some of these changes. The changes wouldn't be huge -- the troubling part for me is trying to run some flavor of mixed mode where FlatSharp supports both bigbuffer and standard modes. Either by themselves is a much more straightforward configuration.
Consider binary encoding or Cap'n Proto.

TYoungSL commented 3 years ago

Cap'n Proto's C# implementations leave something to be desired.

jamescourtney commented 3 years ago

Based on the comments in https://github.com/google/flatbuffers/issues/5471, it seems like 64 bit mode might have a future sooner than the FlatBuffers2 proposal. If that comes to pass, then I'll definitely support it in FlatSharp.

jamescourtney commented 3 years ago

To clarify, I will do my best to support it. There are still some limitations in .NET around int32 that need to be thought through. For example, Span<T> and IList<T> accept only int32 indexes.

TYoungSL commented 3 years ago

Yeah interested to see what shakes out.

jamescourtney / FlatSharp

Mechanism to support different offset (uoffset_t) size per table #179