What would a "FlatBuffers2" binary format look like?

aardappel commented 4 years ago

FlatBuffer's binary format has been set in stone for 6.5 years now, because we value binary forwards/backwards compatibility highly, and because we have a large investment in 15? or so language implementations / parsers etc. that would not be easy to redo.

So "V2" in the sense of a new format that breaks backwards compatibility may never happen. But there is definitely a list of issues with the existing format that if a new format were to ever happen, would be nice to address. I realized I never made such a list. It would be nice to at least fantasize what such a format could look like :)

Please comment what you would like to see. Note that this list is purely for things that would change the binary encoding, or larger additions to the binary encoding. Anything that can be solved with code / new APIs outside of the binary format does not belong on this list.

Remove all padding. Modern CPUs can access unaligned data at normal speed. This would shrink the format somewhat, and encourage other variable size things. If anyone ever needs padding to be compatible with a C struct, explicit padding can always be added, or it can be an opt-in feature.
Make unions into a single field (and vectors of them into a single value). Also make the type part 16-bit while we're at it, so a union is always a 6-byte struct.
Remove the 2nd field of the vtable. This field stores the table size, but it is never used in any implementations. This was intended for a streaming API that never happened.
Allow different size vtable offsets. Currently they're always 16-bit, but for small tables 8-bit would be feasible. Since we code-gen vtable access, this would come at no cost. Use with care of course, because once you choose this smaller size you can't undo it when your table grows.
Allow inline vtables when we determined they're unlikely to be shared. Saves an offset.
Allow inline strings, vectors (and maybe scalars), meaning a vtable offset would refer directly to the string, rather than to the string offset. Saves the offset. Of course puts more pressure on the vtable offset size, so use with case. Similarly, could even do inline scalars of all small scalar types. Of course this makes it more likely that vtables are unequal, so this is a tradeoff.. would work well with inline vtables.
Remove 0-termination of strings. Only C/C++ care for this, and C++ has been moving toward string_view recently, and both have been using size_t arguments for a long time rather than relying on strlen. Other languages don't use it. For passing to super-old C APIs that expect 0-termination, either swap the terminating byte temporarily while passing that string, or copy.
Allow 8 and 16 bit size fields on strings and vectors, currently they're always 32. Good for small strings. Combine all the string optimisations above together, and the string "a" goes from 12 bytes (2 vtable + 4 offset + 4 size + 1 string + 1 terminator) to 3 bytes (1 vtable + 1 size + 1 string). Of course this very inflexible and special purpose, but gives users more options for compact data storage. Again, like all format variation above, this comes at no runtime cost, just some codegen complexity.
Construct the buffer forwards (rather than backwards like currently all implementations). This simplifies a lot of code and buffer management. Unsigned child offsets would now always point downwards in memory. Downside: must now detect table fields pointing to the table itself.
Always have a file_identifer, and make it the first thing in the buffer? Always have a length field as well?
Support 64-bit offsets from the start. They would be optional for vectors and certain other things, allowing buffers >2GB. See https://github.com/google/flatbuffers/projects/10#card-14545298
For a buffer that has entirely un-shared vtables (see 5), it now becomes more feasible to allow in-place mutation of more complex things. This is definitely a complex/contentious feature, but I think if we ever re-booted the format this should be designed in from the start if possible.
Deeply integrated FlexBuffers, basically allowing any field to cheaply be a FlexBuffers value such that it effectively becomes FlatBuffers's "dynamic type". Sharing of strings across such values rather than being an isolated nested buffer.
Nested vectors. Not strictly a breaking change, but a new format would probably want to have them from the start.
Built-in LEBs (variable sized integers) as an optional varint type for fields. They could be added to the existing format but make a lot more sense in a system with no alignment.

@rw @mikkelfj @vglavnyy @mzaks @mustiikhalil @dnfield @dbaileychess @lu-wang-g @stewartmiles @alexames @paulovap @AustinSchuh @maxburke @svenk177 @jean-airoldie @krojew @iceb0y @evanw @KageKirin @llchan @schoetbi @evolutional

krojew commented 4 years ago

Some quick thoughts:

Ad 1. We would need to (potentially) deal with compiler/target platform problems when dealing with unaligned access. Is adding padding such a big problem at the moment?

Ad 4. This sounds like a compatibility problem while migrating schemas.

aardappel commented 4 years ago

@krojew

1) is not a problem. In the most basic case you make every scalar load go thru memcpy which gets optimized by every compiler (tested with clang, gcc and vs) into a single memory load, but one that is guaranteed to work unaligned. Padding is not a huge problem, but not having it enables a lot of other features (see the rest of the list) which would normally be pointless since 32-bit alignment is so prevalent.

4) Not sure what you mean. You would opt-in to 8-bit offsets. Once you put that in your schema, that table will ways use 8-bit, in all languages, and never change.

mikkelfj commented 4 years ago

random thoughts:

Overall, I like many of the suggestions, but there are too many optional and variable parameters in the proposal which would make things slow. There is a lot more branching going on, and code complexity.

detractors to Wouters original comment:

Ad. 4. variable size vtables make it slow.
Ad 5. inline vtables not a good idea - would require slow extra check always, or at the very least it would have to be specific to the table type.
Ad 6. Optional offsets are slow. It would need to be a separate type. It's just too complex.
Ad 11. 64-offsets optional - also extra check slow speed, but do think we need 64 bit offsets as a separate buffer type.
Ad 7. removal of 0 termination in strings: I'm not sure what that buys us. It saves a single byte if we no longer have padding requirements, that is all. I'm all in favor in string views when possible - use similar in C, but there is no standard, and it would force a memory copy in many C and C++ API calls, for example file names. I don't like that strings are different in FlatBuffers, but not having a 0 is worse.
Ad 12. mutations: these are inherently unsafe - because verifiers would be expensive if they should check that this is safe - so I wouldn't design the format around that. I would make it simpler to copy a buffer or part of a buffer without knowing its type.
Ad 13. FlexBuffers - I'm not a big fan. I see it as a complication. I'd rather have strong JSON integration.
Ad 10. In the current FB format you cannot know if there is file identifier, so in that sense it is good to always require it, but I think it is not used very often in praxis and many different tables in the schema might be using the same file identifier. The type hash that FlatCC introduced to work around that was never broadly adopted. Human specified identifiers are too easy to conflict in 4 bytes. So I think it would be better to remove it entirely. We should also think about always having a length or size field, but allow it to be 0 if unknown, e.g. while streaming - but the size of the field is open - is 32 bits, variable length, or what. If variable length it cannot easily be updated after streaming.
Ad. 14 Nested vectors are good.

my inclusive comments:

Ad 1. I think we can remove padding - it causes a lot of complexity and it is not necessary on most platforms. On C, a flag could mark a platform is unaligned unfriendly - there already is - and then accessors would read differently - it already does in some cases.
need a 16-bit union type
ability to copy tables without knowing its type - requires some vtable annotation. Would allow some generated code to be library code, and allow some gateway processing without knowing the fully schema.
drop nested tables - they are unreasonably complex to get right, although absence of padding would simplify this.
(Ad 8.) We could use a varint format for some fields. The QUIC protocol adopted an unsigned big endian encoding where first byte bit 6 and 7 codes length 1, 2, 3, or 4 bytes. That works very well for size fields. For flatbuffer offsets this could also work if the type was signed, but I am afraid it would slow things down significantly.
(Ad 9.) ability to stream buffers while being written - always use signed offsets, also better for some languages that dislike unsigned types, especially if 64 bit - see StreamBuffers: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#streambuffers
Support for mixings beyond: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#mixins - This would require support in the new format to avoid having subtypes always being remote tables instead of inline - also requires some thinking - but last I looked at it, it would be good.

I really would like to remove nested buffers - I has caused me so much headache and they haven't been properly implemented elsewhere. We could have a packed buffer format where multiple buffers an be embedded in the same file or memory block and referenced by some identifier without actually storing the buffer inside the other buffer. It could be an 8-byte random identifier. Buffers could also be given a random initial identifier in this format. Just a thought. This means how buffers are stacked or packed is not critical as long as they can be located.

mikkelfj commented 4 years ago

We also need a proper NULL type for representing database data.

AustinSchuh commented 4 years ago

Somehow I had persuaded myself that 9) doesn't need a format change and that it could be done by just building from the other end of the buffer and allocating space for the full message and vtable when creating the object. It would require a pretty massive API change though. Maybe I'm just hopeful.

My understanding of the premise of flatbuffers (and Cap'n'Proto, which we evaluated before picking flatbuffers) is that compression algorithms are wonderful, so don't burden the format with being clever and compact. Protobufs attempt that, and end up requiring a separate serialization/deserialization step. Sharing vtables goes against that.

Most of the rest of my feedback is at the API level. Happy to give it if there is interest.

maxburke commented 4 years ago

One of the selection factors that resulted in us picking flatbuffers over other formats like protobufs was that flatbuffers doesn't use varints. If they pop up in v2, perhaps it could be an opt-in? Or maybe an attribute applied to fields?
I agree with @AustinSchuh about compression; I think flatbuffers' niche is that by default they are very efficient at runtime, even if they trade off space for that efficiency. In our use we transport flatbuffers over the wire in http that's gzipped/deflated/brotli'd, and on disk we persist them squashed with zstandard, so encoding tricks I think wouldn't really buy us much, but would hurt in our application use.
Please-oh-please add size prefixing by default.
I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.

krojew commented 4 years ago

2. Not sure what you mean. You would opt-in to 8-bit offsets. Once you put that in your schema, that table will ways use 8-bit, in all languages, and never change.

If that's an explicit opt-in, then it seems fine.

I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.

I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.

AustinSchuh commented 4 years ago

I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.

I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.

We have optional scalar fields today. It just isn't plumbed up to be accessible by default to the user. The offset for the scalar fields in the vtable is 0 if they aren't populated. See CheckField<> and GetField<> (for how it handles defaults) for the gory implementation details. (I've got a patch which implements has_ it in C++ if there is upstream interest)

Protobuf started out with required being the default and concluded that it was a bad idea. https://github.com/protocolbuffers/protobuf/issues/2497 is a small part of the discussion.

krojew commented 4 years ago

We have optional scalar fields today. It just isn't plumbed up to be accessible by default to the user.

If something isn't publicly available, it doesn't exist from the user perspective. We should expose such information.

Additionally, I have an impression we're reaching the classic dilemma of speed vs size. So the first question we need to answer is - which one do we favor? For me, FB was always about performance, so if we need to keep padding or sacrifice some internal improvements for the sake of being fast, I would personally stick with that. If some improvement doesn't impact performance, then let's consider it.

rw commented 4 years ago

Allow different size vtable offsets. [...] you can't undo it when your table grows.

Maybe a varint?

rw commented 4 years ago

Would it be possible to have most (all?) of these features be specified with flags at the beginning of a payload? I know that that gets into "framing format" territory, but providing a set of feature flags at the beginning of a table (or at the beginning of an entire buffer) could allow a lot of flexibility at little cost.

It would probably just be an optional initial table at the beginning, containing metadata.

mzaks commented 4 years ago

My 5 cents.

removing padding, is generally a good thing, I am a bit concerned with possible crashes. In FlatBuffersSwift padding is optional and I had issues with A5 chips (iPad2 for example) crashing on me. Could be that it can be mitigated with better library code though.
smaller vTable pointers, yes could be also a flag in the vTable size value. We don't need all 16bits to represent the number of fields, one bit can be spared to indicate if the relative offsets are 1 or 2 bytes long
remove 0 termination in strings. Yes please :). I read the concerns from @mikkelfj, but honestly I think 0 termination is just wrong, specifically as the string is utf8 encoded and can have 0 values. So relying on 0 byte to be end of the string is dangerous anyways. As a compromise we could have a special cstring type. This is what languages like Swift and Rust have. The native string representation is utf8, but for fast interop with C, there is zero terminated cstring.
speaking of strings, I would suggest to represent the length of the string with a varint format. Be it VLQ, FLIT, or something else. This would be a big win for short strings. FLIT suppose to be faster than VLQ, but I am concerned how linked C implementation ignores possible alignment issues.
64-bit offsets, yes please and could we please allow cycles now :). I am ok if it is under a configuration flag and users need to explicitly opt in, sign it in blood on the schema definition. But being able to represent full graphs and not just DAG is big, as it opens up possibilities other formats can not allow. Specifically for object API. As object graphs can have cycles and we can encode them in FlatBuffers, just one to one. FlatBuffersSwift does it already and I am happy to help bring it to any other language. It would start with flatc identifying tasbles in schemas which have recursive (transitive recursive) definitions.
I have also ideas regarding possible, breaking / non breaking features for FlexBuffers, not sure if it is the right place to write them though.

mustiikhalil commented 4 years ago

My personal View on the following:

Construct the buffer forwards (rather than backwards like currently all implementations). This simplifies a lot of code and buffer management. Really like it, at least for swift, we would be able to handle stuff much smoother than the current implementation.

@maxburke

I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.

@krojew

I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.

This is an amazing idea, actually. Which made me think if we can actually remove the vtables completely, since all the fields are going to be required, or optional as mentioned above can be presented as a Bool. This would allow us to remove the vtable, and use the Generated table that's already predefined.

example: when trying to encode the monster object, the following happens 4, 0, 0, 0, 0, 8, 0, 12, 0, 16, 20, 0, 24, 0, 28, 32, 36, 40, 44, 48, 0, 0, 0, 52, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56, 0, 60, 64, 68, 0, 0, 0, 72, 0, 76, 0, 0, 80 -> Monster.name, offset. 4, 0, 0, 0, 0, 8, 0, 12, 0, 16, 20, 0, 24, 0, 28, 32, 36, 40, 44, 48, 0, 0, 0, 52, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56, 0, 60, 64, 68, 0, 0, 0, 72, 0, 76, 0, 0, 80 -> Monster.name, this will be removed and a reference will be created to the old encoded table 245, 245, 250, 0. Now all we have to do is relate it the predefined table we have, where we can simply reference the number of the table instead.

mzaks commented 4 years ago

Regarding the optional / required table field status mentioned by @maxburke. I think changing current behaviour is a bad idea in regards to backward and forward compatibility. Marking fields as non optional is a "lie" we make for API convenience. There are no guaranties that a field is present, as we have no control over, who created the buffer, given your system is not a closed one. When we define a field as required, we commit to not being able to deprecate this field. We also can't set a new field as required unless we can guarantee that no older clients which don't know this field exists, will send buffers to newer clients which require this field to exist.

Speaking of forwards and backwards compatibility, enums is another blind spot. Specifically the way enums are converted to JSON. I think in order to grant proper backwards and forwards compatibility enums need to be always represented as numbers, even though it looks nicer as text in JSON. But that breaks with new cases and also if a case has to be renamed.

mzaks commented 4 years ago

I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.

There are two options how it can be solved in current FlatBuffers implementation.

You disable default values and check if the value in vTable is 0 (I think most libraries have code for that)
You introduce a struct which wraps the scalar value. As structs can't have default values and are zero cost if they have only one field, you get optional scalar value "for free".

I guess for the new version this kind of feature could be addressed in a more straight forward way, by being able to define a table scalar field as optional in the schema directly.

krojew commented 4 years ago

Switching to required by default does not break future compatibility any more than the current attribute. On the other hand we gain possible optimizations and a more intuitive schemas. Today some fields are truly optional and require annotating with an attribute to make required, while other cannot be annotated as such and it's up to the users to guess if they're optional or not.

mzaks commented 4 years ago

Switching to required by default does not break future compatibility any more than the current attribute.

There are two options how one can design default behaviour:

Convenience first. Enabling power users to be as "lazy" as possible.
Safety first. Enabling novices and people new to the topic, not to shoot themselves in the foot in the long run, just because they did not understand all the details.

Setting optional per default is going with safety first approach. You can add required keyword to the schema any time if you decide that it is good for you. Switching from required to optional however is potentially dangerous in the long run.

On the other hand we gain possible optimizations and a more intuitive schemas.

Optimizations from technical (performance) perspective? I don't see it, but please I am open for suggestions?

Today some fields are truly optional and require annotating with an attribute to make required, while other cannot be annotated as such and it's up to the users to guess if they're optional or not.

When in doubt always check for null. Specifically if you use FlatBuffers for communication. If I know your server expects a required field I can send you buffers with null and crash your servers. This is a super easy DDoS attack angel.

krojew commented 4 years ago

Setting optional per default is going with safety first approach. You can add required keyword to the schema any time if you decide that it is good for you. Switching from required to optional however is potentially dangerous in the long run.

This will be a bit anecdotal, but every time I have been introducing FB to new people, optional by default was quite surprising to them. Also, experience shows people tend to create schemas corresponding to their data model, so if something is required now, it gets marked as required. I believe switching to required by default will be the intuitive way to go. We'll never have complete safety or be future proof unless everything is optional always, which is not the goal.

Optimizations from technical (performance) perspective? I don't see it, but please I am open for suggestions?

Required things can always be stored inline. This is quite good for performance.

When in doubt always check for null. Specifically if you use FlatBuffers for communication. If I know your server expects a required field I can send you buffers with null and crash your servers. This is a super easy DDoS attack angel.

Scalar types don't have a notion of null. That's another thing I would love to see - take a look at Option in Rust. Also, verifying a buffer is a separate subject, so let's not mix it in.

Side note, is anyone working on a verifier for Rust, like in cfb?

mzaks commented 4 years ago

This will be a bit anecdotal, but every time I have been introducing FB to new people, optional by default was quite surprising to them.

Ok lets go with anecdotal 😀. In 2015 I worked on a city builder game where we stored user progress in FlatBuffers. The game was developed in Unity3D, but the town map itself was an isometric view. So a building position could be identified with a Position table which caries x and y as grid coordinates. In 2016 game designers introduced hills on the map. So Position carrying just x a y was non sufficient any more. We had to introduce z field. We introduced it and all worked perfectly smooth. Imagine we would introduce z as required field. What would happen? Probably nothing in development, as in development you mostly start from a fresh start, or have an admin tool to populate the city automatically. But when we would ship the change all the new version of the game would crash on start. Why? Because the stored game state have Position without z field and z field is now required. This would be a small disaster as it was a mobile game, deploying a fix on iOS can take days. So non of the existing player can play the game for days, you will have a short term impact of revenues going down and possible long term impact, of people installing another game of the same genre and abandoning your game all together.

What I am trying to visualise with this colourful example, is that with required being default the evolution of a schema becomes a mines field, which only people with experience will be able to avoid. To be honest with you, when we switched from Position(x, y) to Position(x, y, z) we would tap in the mine as well. It was our luck that we were protected by the sensible default behaviour of FlatBuffers. Because you are absolutely correct:

... people tend to create schemas corresponding to their data model, so if something is required now, it gets marked as required

Also regarding this:

We'll never have complete safety or be future proof unless everything is optional always, which is not the goal.

I am not sure why removing required altogether is not the goal? ProtoBuffers version 3 did it and I personally would vote for doing it in FlatBuffers too. 😉

Anyways I think, I wrote enough. The decision lays with @aardappel.

krojew commented 4 years ago

@mzaks I think there's a misunderstanding somewhere, because your example is quite wrong in this case. First of all, adding a new field to a table (required or not) will not break existing code (assuming the usual schema evolution guidelines). Second - there's domain data known to be required and that gets marked as required, which is perfectly ok. Arguing that everything should be optional is over-defensive and will only lead to frustrations of not having the ability to express the data model properly in messages (with the additional burden of fact-checking every piece of data). Third - saying that someone did something is not a valid technical argument discussion :)

That being said, my opinion on optional data for a future protocol, if it ever happens, is:

Make required the default.
Add proper optional support for scalar types.

stewartmiles commented 4 years ago

I agree strongly with @rw that using a format specifier as part of the header would definitely be a great way to evolve the format.

Lots of great comments on this thread, here's my thoughts to add to the mix:

I really don't like the idea of removing all padding. One of the major benefits of FlatBuffers is the ability to read into memory and use, this removes this possibility for many architectures. If this functionality is preserved with a flag for those odd CPUs (i.e most) that require natural alignment to be efficient then I guess that's ok.
Moving unions into a single field sounds good. 6-bytes vs. 8-bytes .. why though? If a developer is looking for a compact format - FlatBuffers definitely trades speed for size - then perhaps they could try something else?
Removing the size field of vtable seems reasonable.
Since this is code-gen'd sounds reasonable though more complicated when inspecting the format.
Inline vtables SG assuming this can be handled at codegen time.
Inline strings / vectors SG assuming this can be handled at codegen time.
Removing 0-termination of strings seems like a recipe for things blowing up in special ways, especially since I would assume most C/C++ code still assumes zero terminated strings. It again seems like a trade-off between size / speed, without a zero terminate string C/C++ code potentially has to copy which seems bad. Perhaps for cases where the developer knows up front they could use a flag to strip the terminator and the C/C++ code would generate accessors that potentially copy if a caller needs a zero terminated string (yuck).
8-bit and 16-bit size fields sound ok assuming the conditions are evaluated at code gen time. Again, feels like a size vs. speed trade-off. Perhaps consider making this sort of thing optional?
Constructing the buffer forwards would certainly more intuitive, though why did you construct backwards again? Per 26a30738a4fa4e92300821fd761764ec8df2dcf2 (wow that was a long code review a long time ago) The current implementation constructs these buffers backwards, since that significantly reduces the amount of bookkeeping and simplifies the construction API. no longer the case?
Definitely always include the file identifier, a couple of bytes always being present allows the format to evolve. I vaguely remember a discussion about this :)
For 64-bit offsets from the starting vector seems like a reasonable trade-off for flexibility.
Anything that makes mutation easier would be great. Even in the case with shared v-tables - I know the implementation is more complex because you're basically doing copy-on-write - it would be great to more efficiently handle mutation and packing of a FlatBuffers so that use cases of other well used RPC formats can be replaced ;)
Flexbuffers integration would be great to mix schema-less data into the API. Would be far better than implementing schemaless data in Flatbuffers which ends up being a common pattern.
Nested vectors could be neat, not critical though, it's a tiny bit a math to convert a flat to nested vector.
Built-in variable sized integers again sound like a space vs. speed trade-off, if fields are explicitly marked as variable sized. Before implementing a feature like this it's probably worth figuring out what the size savings and time cost would be for functionality vs. laying other compression schemes on top of FlatBuffers.

adsharma commented 4 years ago

Support for indexed tables where a flatbuffer table is annotated with "key" and "value" annotations. This sounded like a niche use case - but I believe it's an important one and likely to be occupied by something else if not addressed.

krojew commented 4 years ago

I think 0-terminated strings are not an issue. Removing C-style string getters would solve it.

vglavnyy commented 4 years ago

Remove all padding.

It is possible if use bit_cast<T> whenever in C++ where access to scalars. It will put C++ implementation on an equal footing with other languages than can’t use reinterpret cast. It would be better to assume that alignment (padding) of a field may be a random number in the range [0..N] (not a power of 2). Internal implementation should not depend on a user-defined alignment that can be specified for every field in a schema.

Allow different size vtable offsets.

Allow 8 and 16 bit size fields on strings and vectors, currently they're always 32.

For vtable offsets and types ASN1 or variable-length encoding can be used. It may be a good idea to make the length of fbs-message aligned to 4 or 8 to pre-fetch data without extra checking. This is simple 1:2, or 1:4, or 1:8 decoder:

  auto x = load_uint32(bytes); // it can be uint16 or uint64
  auto offset = ((0u - (x & 1u)) & x | (x & 0xFFu) ) >> 1u; // [0x00-0x7F] or [0-0x7fffffff]
  bool is_1 = !(x & 1u);

FlexBuffer

Fast FlexBuffer with writing to pre-allocated memory (without any memory allocations) can be useful as a fast logging core.

aardappel commented 4 years ago

@mikkelfj not sure why you refer to some of these as "slow": like I said, these are all intended to be codegen-ed (or become a template/macro arg) so will all be maximally fast, certainly no slower than any existing feature. None of these feature are intended to have dynamic checks.

Not sure what you mean by strong JSON integration. JSON is text and not random access, so serves an entirely different use case than FlexBuffers. I am not sure what it would even mean to use JSON in this case, other than to store a string.

A new format could mean a new way to do file_identifier, I'd be open to that. Could be variable length.

"drop nested tables" .. what does that mean?

aardappel commented 4 years ago

@AustinSchuh forward building would be a big format change, since now children typically come before parents, and unsigned child offsets would be flipped to always point downwards in memory instead. Retaining the feature that these offsets always point one way and never can form a cycle would be good, I think.

A lot of people use FlatBuffers in cases where sparse / random access, or use with mmap are important, and those don't work with an external compression pass. I personally think there's a lot of value storing things more compactly in memory that is directly useable. Or at least, that is what FlatBuffers specializes in. None of these more compressed representations should be any slower (in fact, faster) than the current representation, they mostly just complicate codegen.

aardappel commented 4 years ago

@maxburke the varints would be very much optional, as they are definitely slower to read. They would be a type, so you'd explicitly write either int (as right now) or varint for a field.

See above about compression and efficiency.

Default required would be problematic for an evolving format. Protobuf came to the opposite conclusion after many years of experience, and FlatBuffers went along with that.

aardappel commented 4 years ago

@AustinSchuh

The offset for the scalar fields in the vtable is 0 if they aren't populated

Yes, and that means that the value is equal to the default, not that the value is not present. So can't be used to test for this purpose.

I agree that not being able to differentiate this has been something many users would have wanted. On the other hand being able to access scalars "blindly" without checking for presence, and the storage savings from defaults is not something I would want to miss.

aardappel commented 4 years ago

@rw

Maybe a varint?

That would make the distinction dynamic, and costly. Besides, we need to index into this table, which makes varints useless unless all the same size. The idea is that this is a static, codegen feature, associated with a particular type of table.

Would it be possible to have most (all?) of these features be specified with flags at the beginning of a payload?

No, for the same reason. It's extremely important FlatBuffers stays fast, so can't rely on dynamic checks for different encodings. Besides, the idea is to specify them per type, not per buffer.

aardappel commented 4 years ago

@mzaks Unaligned accesses will crash on ARM if the C/C++ (or Swift) code was generated assuming alignment, since unlike x86 they are different instructions. But it is pretty easy to portably generate unaligned instructions (from C/C++ at least).

I wouldn't want to add more branching to vtable access, so this can't be a dynamic flag.

I already suggested being able to specify the type of the size field of a string, so varint would be a great option there :) It could even be the default, given that 99% of strings are probably < 128 bytes, and reading a 1 byte LEB is very fast.

We can allow cycles if someone invents a verifier that can deal with cycles and uses no memory :) I personally do not care for cycles but that is a discussion we can have if it ever comes to this.

I guess if the imaginary time comes when we get to make a V2 of FlatBuffers, also revving FlexBuffers at the same time would make a lot of sense. I haven't thought too much about what I would change there.

aardappel commented 4 years ago

@mustiikhalil if you wanted to remove vtables and make everything required, then your ideal format already exists: it's called Cap'n Proto! ;)

aardappel commented 4 years ago

I recommend we do not go to far into the "required vs optional" default discussion, because in the end this is an API/schema feature, not a binary format feature, assuming you want to keep vtables.

aardappel commented 4 years ago

@stewartmiles

I think ARMv5 is the last mainstream architecture that couldn't read unaligned with a single instruction, or am I missing something? Given that this would be a forward looking format, I think either not supporting chips that old or it compiling down to multiple instructions for compatibility is acceptable.
6 bytes is simply enabled by the lack of padding, much like many features on this list. Saving 2 bytes is not the biggest deal, but it should not be any slower to access, so why not?
The current format is already unparsable without a schema, and yes, this will make it slightly worse. Also reflection code will be slower and more complicated.
Yes.
Yes.
I don't think there's a lot of C++ code still out there that would touch that byte, but you're right that if it does, it be hard to track. Worth considering.
Yes. And it would be optional (default to the current 32-bit).
Thanks for digging that up :) I think the bookkeeping was under the assumption that you want the end result to be parents in memory first, which would require fixups. But there really is no reason why the parents can't be later in memory.
Yup, I think so! Tie-ing it to unshared vtables would be a good simplification, I think. Fun bit of FlatBuffers history: my absolute first design for FlatBuffers had unshared & inline vtables of 8-bit offsets (with offsets scaled to size of thing pointed to, to in theory allow up to 2048 byte tables). But discussing this with you and others it became clear that 16-bit was a safer bet. But now the tables became bigger, so sharing was introduced, and with it, the offset indirection :)
Yes, this would be an explicit type. Not a fan of compression on top of FlatBuffers as it loses so much of its unique properties that way.

Edit: added blank numbers, since github markdown renumbers these to be contiguous if not present??

aardappel commented 4 years ago

@adsharma We already have this? You can annotate a field as key (and the other fields are effectively the value) and then use it with binary search lookups if you place these in a vector.

aardappel commented 4 years ago

@vglavnyy std::bit_cast takes a const T &, which you cannot legally construct an unaligned version of, so doesn't seem that useful? To be fully correct we'd have to go via std::memcpy which all compilers luckily can optimize into a single memory load.

rw commented 4 years ago

@rw

Would it be possible to have most (all?) of these features be specified with flags at the beginning of a payload?

No, for the same reason. It's extremely important FlatBuffers stays fast, so can't rely on dynamic checks for different encodings. Besides, the idea is to specify them per type, not per buffer.

I still think this is worth looking into more. The reason is that this could open up a FlatBuffers ecosystem, beyond just one new FlatBuffers version. Maybe there's a way to have a compile-time fast path for AoT feature flag specifications, alongside a slightly-dynamic mode of operation, too.

AustinSchuh commented 4 years ago

@AustinSchuh

The offset for the scalar fields in the vtable is 0 if they aren't populated

Yes, and that means that the value is equal to the default, not that the value is not present. So can't be used to test for this purpose.

I agree that not being able to differentiate this has been something many users would have wanted. On the other hand being able to access scalars "blindly" without checking for presence, and the storage savings from defaults is not something I would want to miss.

I forgot to mention, we run with ForceDefaults(true) everywhere. The presence of that option, support for per-field metadata in the schema, and zero copy reads were the reasons we chose Flatbuffers. Once you do that, has works as expected. There is also nothing saying that has and returning the default when unset are exclusive. Flatbuffers today return the default if the field isn't populated.

mzaks commented 4 years ago

@aardappel

We can allow cycles if someone invents a verifier that can deal with cycles and uses no memory

First thing would be to enhance flatc to detect recursive table definitions. I should be able to do that, I already done it for my code generators. The verification and table build API can stay as is if there are no recursions in the definition. If a cycle is detected flatc can check if a special attribute (e.g. allow_cycles) is present in the schema. If it is not present the generated code states the same as if there would be no cycles. Verification is still fast, building a buffer with standard API does not support building a cyclic graph anyways. But if the user explicitly marked the schema with allow_cycles, then flatc produces additional builder API for second pass building step. This inclines that we can add dummy table reference in the first pass and replace it with actual table reference in the second pass.

Verification can be done in two steps as well. In first pass we do the fast and established verification. However if the reference value is pointing backwards and users enabled allow_cycles then keep the backwards references in the memory and check in the second pass if the backwards references are pointing to actual tables.

This would follow the premise, keep fast things fast and slow things possible, if necessary.

adsharma commented 4 years ago

@adsharma We already have this? You can annotate a field as key (and the other fields are effectively the value) and then use it with binary search lookups if you place these in a vector.

I don't think the current code is good enough to support a database with multiple key fields and being able to construct a valid flatbuffer with zero copy from a key+value read from a store.

Previous work from 2016:

https://drive.google.com/open?id=0BwPRLHxpLD-adVM2UVNFMVFCUm8 https://github.com/adsharma/flatbuffers/commits/byteorder2

oberstet commented 4 years ago

it would be awesome to have more scalar types available:

uint128 used eg in UUID
uint160 used eg in Ethereum for blockchain addresses
uint256 used eg in Ethereum for all numbers
decimal128 decimal floating point https://en.wikipedia.org/wiki/Decimal128_floating-point_format - used eg in accounting/databases

mikkelfj commented 4 years ago

it would be awesome to have more scalar types available:

This can be done with fixed length arrays in the current format. However, a typedef in the schema would be helpful.

krojew commented 4 years ago

I think having a uuid as a built-in type would be useful. It's so popular, I think having it, even as a sugar for an array, will benefit users.

mikkelfj commented 4 years ago

That's what a typedef will do for you. Meanwhile it can be wrapped in a struct.

oberstet commented 4 years ago

This can be done with fixed length arrays in the current format.

can fixed length array types be used the same as built-in scalars (say unit64)? would be great, because if so, yes, a built-in typedef would fully solve this ..

another source of difference might be the binding code generators producing different bits for fixed size arrays vs scalars?

mikkelfj commented 4 years ago

can fixed length array types be used the same as built-in scalars (say unit64)? would be great, because if so, yes, a built-in typedef would fully solve this ..

No but they can be used for things like UUID's or crypto values. A typedef solution would be able to define uint64 to mean something else and preserve underlying int type. Regardless, this has nothing to do with FB2 - it can easily be added to current FB if so desired. You could also have a struct with just a uint64 member to simulate a typedef, but that isn't very convenient.

aardappel commented 4 years ago

@AustinSchuh but force defaults potentially wastes a lot of space, so I don't think it's a particularly good solution. Particularly calls like Create which serialize all fields are assuming default values are not serialized.

aardappel commented 4 years ago

@mzaks yes, as opt-in with some kind of allow_cycles it could work, if there's enough demand for such a thing. We'd need signed offsets in general then though, or generate code to interpret them as signed with allow_cycles on.. that will complicate the library code a lot.

aardappel commented 4 years ago

@adsharma multiple keys would also require multiple sort orders to be efficient, or indices separate from the vector containing the values.. the current solution is simple because it works almost transparently on existing vectors.

If you wanted multiple sort orders, you could do this today by storing multiple vectors. But yeah, we'd need some schema support to indicate which key is used with which vector.

aardappel commented 4 years ago

@oberstet @mikkelfj @krojew I'd say a struct makes for a pretty good typedef already: struct UUID { v:[uint:4]; }

We could consider native typedef, but I am not sure what it would add. The problem with typedef in C/C++ is that you typically don't want to allow arbitrary mixing of the new type with the type it's based on, which is exactly what struct already prevents.

Of course we have no way to typedef strings/vectors currently, so who knows that's useful.

adsharma commented 4 years ago

If I were to use flatbuffers instead of SQL DDL to represent a database, we would have one flatbuffer for the primary key + data and one flatbuffer for each index, where the key is a sequence of fields making up the secondary fields and the value is the primary key (typically an id).

This is a powerful use case for copy elimination on fast devices such as NVMe or even in-memory.

On Thu, Apr 30, 2020 at 9:37 AM Wouter van Oortmerssen < notifications@github.com> wrote:

@adsharma https://github.com/adsharma multiple keys would also require multiple sort orders to be efficient, or indices separate from the vector containing the values.. the current solution is simple because it works almost transparently on existing vectors.

If you wanted multiple sort orders, you could do this today by storing multiple vectors. But yeah, we'd need some schema support to indicate which key is used with which vector.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/flatbuffers/issues/5875#issuecomment-621966235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFA2A43UMAOHXSXCBROIA3RPGSMJANCNFSM4MRLM4MA .

google / flatbuffers

What would a "FlatBuffers2" binary format look like? #5875