Closed TYoungSL closed 3 years ago
Thanks for the thoughtful comments. This is something I'd love to see happen, but the line for FlatSharp right now is not doing anything to cause binary incompatibility with the canonical library.
So the starting point for this needs to be a proposal in the official FlatBuffer repo. If you don't get traction there, then I think it's worth considering if there are sane ways FlatSharp can do extensions such that people can't mix standard and extension formats.
One idea for this is a set of directive statements at the top of the file:
FlatSharp.Directive OffsetSize = 8;
Ignoring the issue of imports, the thing I like about this is that it will cause a hard break with flatc
, so it won't be possible to compile a schema there if this directive is also specified.
@aardappel wants to do it for FlatBuffers2;
From https://github.com/google/flatbuffers/issues/5875#issue-607087371
- Support 64-bit offsets from the start. They would be optional for vectors and certain other things, allowing buffers >2GB. See https://github.com/google/flatbuffers/projects/10#card-14545298
There's plenty of support, but not immediate concern. I'll create an issue for it if I can't find relevant discussion in the issue regarding specifying offset size (https://github.com/google/flatbuffers/issues/5471).
Yes, a parser-breaking extension to the FBS grammar would be a good idea to break unintentional apparent compatibility.
I'd propose something that would be more possible for other implementations to generically support.
E.g. along the lines of directive FlatSharp OffsetSize = 8;
or similar; such that the grammar can be expressed with a definite start (directive
), namespace token (FlatSharp
here), termination token (;
), in the vein of #pragma
.
Anything unsupported by the implementation attempting to support it would break with a better explanation than just a line number.
table LargeTable {
directive FlatSharp OffsetSize = 8; // long sized uoffset_t
someLargeVector:[ulong]; // can be > 1GB, > 2GB, etc.
anotherLargeVector:[ulong];
}
In a compiler that is attempting to support directives generically;
Unsupported table directive in LargeTable: FlatSharp OffsetSize = 8
In FlatSharp if OffsetSize is unsupported;
Unsupported FlatSharp table directive in LargeTable: OffsetSize = 8
In a compiler that is not attempting support;
Unknown member directive in LargeTable, line 2
A word other than directive may be appropriate, maybe something implying more requirement than option.
I'm interested to see if the FlatBuffers2 proposal has any real traction or is just a wish list. Thanks for asking. I read it over a year ago and thought "that sounds neat!", and I still think that! It really would be a meaningful improvement to the format in a lot of ways.
With respect to FlatSharp, implementing "mixed-mode" offset sizes is going to be hard. This is due to implementation details in FlatSharp where it wants to write a single method for each CLR Type, so if there is a method to serialize TableA
with 4 byte offsets and a different method for 8 byte offsets, that's going to invalidate a lot of assumptions. Not sure if that's better or worse than having a global offset directive.
I'm interested to see if the FlatBuffers2 proposal has any real traction or is just a wish list.
A wish list.
I'm interested to see if the FlatBuffers2 proposal has any real traction or is just a wish list.
A wish list.
Appreciate the response, @aardappel!
@TYoungSL, I'm really of two minds about this.
The first half says that large buffers would be useful, and adding them in a non-dangerous (ie, incompatible with flatc
) way would be a fun project.
The other half says that at the end of the day, FlatSharp is an implementation of FlatBuffers, 2GB limit and all. So while I can toe the line of the FlatBuffers format with various features (type facades, indexed vectors, etc), binary compatibility and correctness of the format are king, because much of the usefulness of the project derives from those two things.
The parts that concern me most are:
If ultimately we have data structures that exceed 1-2GB, perhaps FlatBuffers was a poor choice.
If ultimately we have data structures that exceed 1-2GB, perhaps FlatBuffers was a poor choice.
Most serialization formats are going to be poor choices if your goal is 100GB! Protobuf has a hard limit of 2GB. The only one I know of that supports 2^64 is Cap'n Proto, but I've not personally used it or the C# implementation.
Based on my admittedly limited knowledge of your problem space, I think you have a few viable paths forward, but they all involve you writing some code.
Write an indirection wrapper on top of your root FlatBuffer type. I don't think you ever said that your arrays of structs had to be contiguous, so have 100 1GB files and just use some modulo arithmetic to figure out the mapping of index -> file. Matrices might be slightly trickier, but it seems like you could divide up the larger matrix into many smaller ones. You can then wrap this in some object that exposes an indexer that accepts a long and presents a unified view to the world. Then your logic can operate mostly independently of the actual file structure of the data. I think you maintain your ability to to SIMD here as well.
I don't think you ever mentioned having interop be a concern, so fork FlatSharp and use it as the basis for a FlatBuffer-inspired data format of your own design. This would allow you to support huge buffers natively, add first-class support for System.Numerics
types, and other things that your use cases require. I'd be happy to consult with you on some of these changes. The changes wouldn't be huge -- the troubling part for me is trying to run some flavor of mixed mode where FlatSharp supports both bigbuffer and standard modes. Either by themselves is a much more straightforward configuration.
Consider binary encoding or Cap'n Proto.
Cap'n Proto's C# implementations leave something to be desired.
Based on the comments in https://github.com/google/flatbuffers/issues/5471, it seems like 64 bit mode might have a future sooner than the FlatBuffers2 proposal. If that comes to pass, then I'll definitely support it in FlatSharp.
To clarify, I will do my best to support it. There are still some limitations in .NET around int32 that need to be thought through. For example, Span<T>
and IList<T>
accept only int32
indexes.
Yeah interested to see what shakes out.
Per my https://github.com/jamescourtney/FlatSharp/issues/158#issuecomment-869063251 leading to @aardappel's plans for extending FlatBuffers to be 64-bit capable and/or creating FlatBuffers2, an attribute per table to change the encoding to a non-standard offset size may be a useful respite until official action is taken to provide support.
Of course such an implementation would not be expected to be binary compatible with other implementations, but keeping it simple might allow it to be compatible with future spec.
An implementation of variable sized ints is not meaningfully useful for the additional code required at this time, but they could also be done in a similar fashion.
They were discussed as a proposal for FlatBuffers2. They are a flexible alternative solution to achieve different offset sizing, but the additional code may not be necessary to achieve the same goals and is unlikely to provide any real world benefits over explicitly specified per-table offset sizing.