Open aardappel opened 4 years ago
Some quick thoughts:
Ad 1. We would need to (potentially) deal with compiler/target platform problems when dealing with unaligned access. Is adding padding such a big problem at the moment?
Ad 4. This sounds like a compatibility problem while migrating schemas.
@krojew
1) is not a problem. In the most basic case you make every scalar load go thru memcpy
which gets optimized by every compiler (tested with clang, gcc and vs) into a single memory load, but one that is guaranteed to work unaligned. Padding is not a huge problem, but not having it enables a lot of other features (see the rest of the list) which would normally be pointless since 32-bit alignment is so prevalent.
4) Not sure what you mean. You would opt-in to 8-bit offsets. Once you put that in your schema, that table will ways use 8-bit, in all languages, and never change.
random thoughts:
Overall, I like many of the suggestions, but there are too many optional and variable parameters in the proposal which would make things slow. There is a lot more branching going on, and code complexity.
detractors to Wouters original comment:
my inclusive comments:
I really would like to remove nested buffers - I has caused me so much headache and they haven't been properly implemented elsewhere. We could have a packed buffer format where multiple buffers an be embedded in the same file or memory block and referenced by some identifier without actually storing the buffer inside the other buffer. It could be an 8-byte random identifier. Buffers could also be given a random initial identifier in this format. Just a thought. This means how buffers are stacked or packed is not critical as long as they can be located.
We also need a proper NULL type for representing database data.
Somehow I had persuaded myself that 9) doesn't need a format change and that it could be done by just building from the other end of the buffer and allocating space for the full message and vtable when creating the object. It would require a pretty massive API change though. Maybe I'm just hopeful.
My understanding of the premise of flatbuffers (and Cap'n'Proto, which we evaluated before picking flatbuffers) is that compression algorithms are wonderful, so don't burden the format with being clever and compact. Protobufs attempt that, and end up requiring a separate serialization/deserialization step. Sharing vtables goes against that.
Most of the rest of my feedback is at the API level. Happy to give it if there is interest.
One of the selection factors that resulted in us picking flatbuffers over other formats like protobufs was that flatbuffers doesn't use varints. If they pop up in v2, perhaps it could be an opt-in? Or maybe an attribute applied to fields?
I agree with @AustinSchuh about compression; I think flatbuffers' niche is that by default they are very efficient at runtime, even if they trade off space for that efficiency. In our use we transport flatbuffers over the wire in http that's gzipped/deflated/brotli'd, and on disk we persist them squashed with zstandard, so encoding tricks I think wouldn't really buy us much, but would hurt in our application use.
Please-oh-please add size prefixing by default.
I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.
2. Not sure what you mean. You would opt-in to 8-bit offsets. Once you put that in your schema, that table will ways use 8-bit, in all languages, and never change.
If that's an explicit opt-in, then it seems fine.
- I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.
I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.
- I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.
I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.
We have optional scalar fields today. It just isn't plumbed up to be accessible by default to the user. The offset for the scalar fields in the vtable is 0 if they aren't populated. See CheckField<>
and GetField<>
(for how it handles defaults) for the gory implementation details. (I've got a patch which implements has_
it in C++ if there is upstream interest)
Protobuf started out with required being the default and concluded that it was a bad idea. https://github.com/protocolbuffers/protobuf/issues/2497 is a small part of the discussion.
We have optional scalar fields today. It just isn't plumbed up to be accessible by default to the user.
If something isn't publicly available, it doesn't exist from the user perspective. We should expose such information.
Additionally, I have an impression we're reaching the classic dilemma of speed vs size. So the first question we need to answer is - which one do we favor? For me, FB was always about performance, so if we need to keep padding or sacrifice some internal improvements for the sake of being fast, I would personally stick with that. If some improvement doesn't impact performance, then let's consider it.
- Allow different size vtable offsets. [...] you can't undo it when your table grows.
Maybe a varint?
Would it be possible to have most (all?) of these features be specified with flags at the beginning of a payload? I know that that gets into "framing format" territory, but providing a set of feature flags at the beginning of a table (or at the beginning of an entire buffer) could allow a lot of flexibility at little cost.
It would probably just be an optional initial table at the beginning, containing metadata.
My 5 cents.
My personal View on the following:
- Construct the buffer forwards (rather than backwards like currently all implementations). This simplifies a lot of code and buffer management. Really like it, at least for swift, we would be able to handle stuff much smoother than the current implementation.
@maxburke
I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.
@krojew
I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.
This is an amazing idea, actually. Which made me think if we can actually remove the vtables completely, since all the fields are going to be required, or optional as mentioned above can be presented as a Bool. This would allow us to remove the vtable, and use the Generated table that's already predefined.
example:
when trying to encode the monster object, the following happens
4, 0, 0, 0, 0, 8, 0, 12, 0, 16, 20, 0, 24, 0, 28, 32, 36, 40, 44, 48, 0, 0, 0, 52, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56, 0, 60, 64, 68, 0, 0, 0, 72, 0, 76, 0, 0, 80
-> Monster.name, offset.
4, 0, 0, 0, 0, 8, 0, 12, 0, 16, 20, 0, 24, 0, 28, 32, 36, 40, 44, 48, 0, 0, 0, 52, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56, 0, 60, 64, 68, 0, 0, 0, 72, 0, 76, 0, 0, 80
-> Monster.name, this will be removed and a reference will be created to the old encoded table 245, 245, 250, 0
.
Now all we have to do is relate it the predefined table we have, where we can simply reference the number of the table instead.
Regarding the optional / required table field status mentioned by @maxburke. I think changing current behaviour is a bad idea in regards to backward and forward compatibility. Marking fields as non optional is a "lie" we make for API convenience. There are no guaranties that a field is present, as we have no control over, who created the buffer, given your system is not a closed one. When we define a field as required, we commit to not being able to deprecate this field. We also can't set a new field as required unless we can guarantee that no older clients which don't know this field exists, will send buffers to newer clients which require this field to exist.
Speaking of forwards and backwards compatibility, enums is another blind spot. Specifically the way enums are converted to JSON. I think in order to grant proper backwards and forwards compatibility enums need to be always represented as numbers, even though it looks nicer as text in JSON. But that breaks with new cases and also if a case has to be renamed.
I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.
There are two options how it can be solved in current FlatBuffers implementation.
0
(I think most libraries have code for that)I guess for the new version this kind of feature could be addressed in a more straight forward way, by being able to define a table scalar field as optional in the schema directly.
Switching to required by default does not break future compatibility any more than the current attribute. On the other hand we gain possible optimizations and a more intuitive schemas. Today some fields are truly optional and require annotating with an attribute to make required, while other cannot be annotated as such and it's up to the users to guess if they're optional or not.
Switching to required by default does not break future compatibility any more than the current attribute.
There are two options how one can design default behaviour:
Setting optional per default is going with safety first approach. You can add required keyword to the schema any time if you decide that it is good for you. Switching from required to optional however is potentially dangerous in the long run.
On the other hand we gain possible optimizations and a more intuitive schemas.
Optimizations from technical (performance) perspective? I don't see it, but please I am open for suggestions?
Today some fields are truly optional and require annotating with an attribute to make required, while other cannot be annotated as such and it's up to the users to guess if they're optional or not.
When in doubt always check for null
. Specifically if you use FlatBuffers for communication. If I know your server expects a required field I can send you buffers with null and crash your servers. This is a super easy DDoS attack angel.
Setting optional per default is going with safety first approach. You can add required keyword to the schema any time if you decide that it is good for you. Switching from required to optional however is potentially dangerous in the long run.
This will be a bit anecdotal, but every time I have been introducing FB to new people, optional by default was quite surprising to them. Also, experience shows people tend to create schemas corresponding to their data model, so if something is required now, it gets marked as required. I believe switching to required by default will be the intuitive way to go. We'll never have complete safety or be future proof unless everything is optional always, which is not the goal.
Optimizations from technical (performance) perspective? I don't see it, but please I am open for suggestions?
Required things can always be stored inline. This is quite good for performance.
When in doubt always check for
null
. Specifically if you use FlatBuffers for communication. If I know your server expects a required field I can send you buffers with null and crash your servers. This is a super easy DDoS attack angel.
Scalar types don't have a notion of null. That's another thing I would love to see - take a look at Option in Rust. Also, verifying a buffer is a separate subject, so let's not mix it in.
Side note, is anyone working on a verifier for Rust, like in cfb?
This will be a bit anecdotal, but every time I have been introducing FB to new people, optional by default was quite surprising to them.
Ok lets go with anecdotal π. In 2015 I worked on a city builder game where we stored user progress in FlatBuffers. The game was developed in Unity3D, but the town map itself was an isometric view. So a building position could be identified with a Position
table which caries x
and y
as grid coordinates. In 2016 game designers introduced hills on the map. So Position
carrying just x
a y
was non sufficient any more. We had to introduce z
field. We introduced it and all worked perfectly smooth. Imagine we would introduce z
as required field. What would happen? Probably nothing in development, as in development you mostly start from a fresh start, or have an admin tool to populate the city automatically. But when we would ship the change all the new version of the game would crash on start. Why? Because the stored game state have Position
without z
field and z
field is now required. This would be a small disaster as it was a mobile game, deploying a fix on iOS can take days. So non of the existing player can play the game for days, you will have a short term impact of revenues going down and possible long term impact, of people installing another game of the same genre and abandoning your game all together.
What I am trying to visualise with this colourful example, is that with required being default the evolution of a schema becomes a mines field, which only people with experience will be able to avoid. To be honest with you, when we switched from Position(x, y)
to Position(x, y, z)
we would tap in the mine as well. It was our luck that we were protected by the sensible default behaviour of FlatBuffers. Because you are absolutely correct:
... people tend to create schemas corresponding to their data model, so if something is required now, it gets marked as required
Also regarding this:
We'll never have complete safety or be future proof unless everything is optional always, which is not the goal.
I am not sure why removing required altogether is not the goal? ProtoBuffers version 3 did it and I personally would vote for doing it in FlatBuffers too. π
Anyways I think, I wrote enough. The decision lays with @aardappel.
@mzaks I think there's a misunderstanding somewhere, because your example is quite wrong in this case. First of all, adding a new field to a table (required or not) will not break existing code (assuming the usual schema evolution guidelines). Second - there's domain data known to be required and that gets marked as required, which is perfectly ok. Arguing that everything should be optional is over-defensive and will only lead to frustrations of not having the ability to express the data model properly in messages (with the additional burden of fact-checking every piece of data). Third - saying that someone did something is not a valid technical argument discussion :)
That being said, my opinion on optional data for a future protocol, if it ever happens, is:
I agree strongly with @rw that using a format specifier as part of the header would definitely be a great way to evolve the format.
Lots of great comments on this thread, here's my thoughts to add to the mix:
The current implementation constructs these buffers backwards, since that significantly reduces the amount of bookkeeping and simplifies the construction API.
no longer the case?Support for indexed tables where a flatbuffer table is annotated with "key" and "value" annotations. This sounded like a niche use case - but I believe it's an important one and likely to be occupied by something else if not addressed.
I think 0-terminated strings are not an issue. Removing C-style string getters would solve it.
- Remove all padding.
It is possible if use bit_cast<T>
whenever in C++ where access to scalars. It will put C++ implementation on an equal footing with other languages than canβt use reinterpret cast.
It would be better to assume that alignment (padding) of a field may be a random number in the range [0..N] (not a power of 2). Internal implementation should not depend on a user-defined alignment that can be specified for every field in a schema.
- Allow different size vtable offsets.
- Allow 8 and 16 bit size fields on strings and vectors, currently they're always 32.
For vtable offsets and types ASN1 or variable-length encoding can be used. It may be a good idea to make the length of fbs-message aligned to 4 or 8 to pre-fetch data without extra checking. This is simple 1:2, or 1:4, or 1:8 decoder:
auto x = load_uint32(bytes); // it can be uint16 or uint64
auto offset = ((0u - (x & 1u)) & x | (x & 0xFFu) ) >> 1u; // [0x00-0x7F] or [0-0x7fffffff]
bool is_1 = !(x & 1u);
FlexBuffer
Fast FlexBuffer with writing to pre-allocated memory (without any memory allocations) can be useful as a fast logging core.
@mikkelfj not sure why you refer to some of these as "slow": like I said, these are all intended to be codegen-ed (or become a template/macro arg) so will all be maximally fast, certainly no slower than any existing feature. None of these feature are intended to have dynamic checks.
Not sure what you mean by strong JSON integration. JSON is text and not random access, so serves an entirely different use case than FlexBuffers. I am not sure what it would even mean to use JSON in this case, other than to store a string.
A new format could mean a new way to do file_identifier
, I'd be open to that. Could be variable length.
"drop nested tables" .. what does that mean?
@AustinSchuh forward building would be a big format change, since now children typically come before parents, and unsigned child offsets would be flipped to always point downwards in memory instead. Retaining the feature that these offsets always point one way and never can form a cycle would be good, I think.
A lot of people use FlatBuffers in cases where sparse / random access, or use with mmap
are important, and those don't work with an external compression pass. I personally think there's a lot of value storing things more compactly in memory that is directly useable. Or at least, that is what FlatBuffers specializes in. None of these more compressed representations should be any slower (in fact, faster) than the current representation, they mostly just complicate codegen.
@maxburke the varints would be very much optional, as they are definitely slower to read. They would be a type, so you'd explicitly write either int
(as right now) or varint
for a field.
See above about compression and efficiency.
Default required would be problematic for an evolving format. Protobuf came to the opposite conclusion after many years of experience, and FlatBuffers went along with that.
@AustinSchuh
The offset for the scalar fields in the vtable is 0 if they aren't populated
Yes, and that means that the value is equal to the default, not that the value is not present. So can't be used to test for this purpose.
I agree that not being able to differentiate this has been something many users would have wanted. On the other hand being able to access scalars "blindly" without checking for presence, and the storage savings from defaults is not something I would want to miss.
@rw
Maybe a varint?
That would make the distinction dynamic, and costly. Besides, we need to index into this table, which makes varints useless unless all the same size. The idea is that this is a static, codegen feature, associated with a particular type of table.
Would it be possible to have most (all?) of these features be specified with flags at the beginning of a payload?
No, for the same reason. It's extremely important FlatBuffers stays fast, so can't rely on dynamic checks for different encodings. Besides, the idea is to specify them per type, not per buffer.
@mzaks Unaligned accesses will crash on ARM if the C/C++ (or Swift) code was generated assuming alignment, since unlike x86 they are different instructions. But it is pretty easy to portably generate unaligned instructions (from C/C++ at least).
I wouldn't want to add more branching to vtable access, so this can't be a dynamic flag.
I already suggested being able to specify the type of the size field of a string, so varint would be a great option there :) It could even be the default, given that 99% of strings are probably < 128 bytes, and reading a 1 byte LEB is very fast.
We can allow cycles if someone invents a verifier that can deal with cycles and uses no memory :) I personally do not care for cycles but that is a discussion we can have if it ever comes to this.
I guess if the imaginary time comes when we get to make a V2 of FlatBuffers, also revving FlexBuffers at the same time would make a lot of sense. I haven't thought too much about what I would change there.
@mustiikhalil if you wanted to remove vtables and make everything required, then your ideal format already exists: it's called Cap'n Proto! ;)
I recommend we do not go to far into the "required vs optional" default discussion, because in the end this is an API/schema feature, not a binary format feature, assuming you want to keep vtables.
@stewartmiles
Edit: added blank numbers, since github markdown renumbers these to be contiguous if not present??
@adsharma We already have this? You can annotate a field as key
(and the other fields are effectively the value) and then use it with binary search lookups if you place these in a vector.
@vglavnyy std::bit_cast
takes a const T &
, which you cannot legally construct an unaligned version of, so doesn't seem that useful? To be fully correct we'd have to go via std::memcpy
which all compilers luckily can optimize into a single memory load.
@rw
Would it be possible to have most (all?) of these features be specified with flags at the beginning of a payload?
No, for the same reason. It's extremely important FlatBuffers stays fast, so can't rely on dynamic checks for different encodings. Besides, the idea is to specify them per type, not per buffer.
I still think this is worth looking into more. The reason is that this could open up a FlatBuffers ecosystem, beyond just one new FlatBuffers version. Maybe there's a way to have a compile-time fast path for AoT feature flag specifications, alongside a slightly-dynamic mode of operation, too.
@AustinSchuh
The offset for the scalar fields in the vtable is 0 if they aren't populated
Yes, and that means that the value is equal to the default, not that the value is not present. So can't be used to test for this purpose.
I agree that not being able to differentiate this has been something many users would have wanted. On the other hand being able to access scalars "blindly" without checking for presence, and the storage savings from defaults is not something I would want to miss.
I forgot to mention, we run with ForceDefaults(true)
everywhere. The presence of that option, support for per-field metadata in the schema, and zero copy reads were the reasons we chose Flatbuffers. Once you do that, has works as expected. There is also nothing saying that has and returning the default when unset are exclusive. Flatbuffers today return the default if the field isn't populated.
@aardappel
We can allow cycles if someone invents a verifier that can deal with cycles and uses no memory
First thing would be to enhance flatc to detect recursive table definitions. I should be able to do that, I already done it for my code generators. The verification and table build API can stay as is if there are no recursions in the definition. If a cycle is detected flatc can check if a special attribute (e.g. allow_cycles
) is present in the schema. If it is not present the generated code states the same as if there would be no cycles. Verification is still fast, building a buffer with standard API does not support building a cyclic graph anyways. But if the user explicitly marked the schema with allow_cycles
, then flatc produces additional builder API for second pass building step. This inclines that we can add dummy table reference in the first pass and replace it with actual table reference in the second pass.
Verification can be done in two steps as well. In first pass we do the fast and established verification. However if the reference value is pointing backwards and users enabled allow_cycles
then keep the backwards references in the memory and check in the second pass if the backwards references are pointing to actual tables.
This would follow the premise, keep fast things fast and slow things possible, if necessary.
@adsharma We already have this? You can annotate a field as
key
(and the other fields are effectively the value) and then use it with binary search lookups if you place these in a vector.
I don't think the current code is good enough to support a database with multiple key fields and being able to construct a valid flatbuffer with zero copy from a key+value read from a store.
Previous work from 2016:
https://drive.google.com/open?id=0BwPRLHxpLD-adVM2UVNFMVFCUm8 https://github.com/adsharma/flatbuffers/commits/byteorder2
it would be awesome to have more scalar types available:
uint128
used eg in UUIDuint160
used eg in Ethereum for blockchain addressesuint256
used eg in Ethereum for all numbersdecimal128
decimal floating point https://en.wikipedia.org/wiki/Decimal128_floating-point_format - used eg in accounting/databasesit would be awesome to have more scalar types available:
This can be done with fixed length arrays in the current format. However, a typedef in the schema would be helpful.
I think having a uuid as a built-in type would be useful. It's so popular, I think having it, even as a sugar for an array, will benefit users.
That's what a typedef will do for you. Meanwhile it can be wrapped in a struct.
This can be done with fixed length arrays in the current format.
can fixed length array types be used the same as built-in scalars (say unit64
)? would be great, because if so, yes, a built-in typedef would fully solve this ..
another source of difference might be the binding code generators producing different bits for fixed size arrays vs scalars?
can fixed length array types be used the same as built-in scalars (say unit64)? would be great, because if so, yes, a built-in typedef would fully solve this ..
No but they can be used for things like UUID's or crypto values. A typedef solution would be able to define uint64 to mean something else and preserve underlying int type. Regardless, this has nothing to do with FB2 - it can easily be added to current FB if so desired. You could also have a struct with just a uint64 member to simulate a typedef, but that isn't very convenient.
@AustinSchuh but force defaults potentially wastes a lot of space, so I don't think it's a particularly good solution. Particularly calls like Create
which serialize all fields are assuming default values are not serialized.
@mzaks yes, as opt-in with some kind of allow_cycles
it could work, if there's enough demand for such a thing. We'd need signed offsets in general then though, or generate code to interpret them as signed with allow_cycles
on.. that will complicate the library code a lot.
@adsharma multiple keys would also require multiple sort orders to be efficient, or indices separate from the vector containing the values.. the current solution is simple because it works almost transparently on existing vectors.
If you wanted multiple sort orders, you could do this today by storing multiple vectors. But yeah, we'd need some schema support to indicate which key is used with which vector.
@oberstet @mikkelfj @krojew
I'd say a struct
makes for a pretty good typedef already: struct UUID { v:[uint:4]; }
We could consider native typedef
, but I am not sure what it would add. The problem with typedef
in C/C++ is that you typically don't want to allow arbitrary mixing of the new type with the type it's based on, which is exactly what struct
already prevents.
Of course we have no way to typedef
strings/vectors currently, so who knows that's useful.
If I were to use flatbuffers instead of SQL DDL to represent a database, we would have one flatbuffer for the primary key + data and one flatbuffer for each index, where the key is a sequence of fields making up the secondary fields and the value is the primary key (typically an id).
This is a powerful use case for copy elimination on fast devices such as NVMe or even in-memory.
On Thu, Apr 30, 2020 at 9:37 AM Wouter van Oortmerssen < notifications@github.com> wrote:
@adsharma https://github.com/adsharma multiple keys would also require multiple sort orders to be efficient, or indices separate from the vector containing the values.. the current solution is simple because it works almost transparently on existing vectors.
If you wanted multiple sort orders, you could do this today by storing multiple vectors. But yeah, we'd need some schema support to indicate which key is used with which vector.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/flatbuffers/issues/5875#issuecomment-621966235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFA2A43UMAOHXSXCBROIA3RPGSMJANCNFSM4MRLM4MA .
FlatBuffer's binary format has been set in stone for 6.5 years now, because we value binary forwards/backwards compatibility highly, and because we have a large investment in 15? or so language implementations / parsers etc. that would not be easy to redo.
So "V2" in the sense of a new format that breaks backwards compatibility may never happen. But there is definitely a list of issues with the existing format that if a new format were to ever happen, would be nice to address. I realized I never made such a list. It would be nice to at least fantasize what such a format could look like :)
Please comment what you would like to see. Note that this list is purely for things that would change the binary encoding, or larger additions to the binary encoding. Anything that can be solved with code / new APIs outside of the binary format does not belong on this list.
string_view
recently, and both have been usingsize_t
arguments for a long time rather than relying onstrlen
. Other languages don't use it. For passing to super-old C APIs that expect 0-termination, either swap the terminating byte temporarily while passing that string, or copy."a"
goes from 12 bytes (2 vtable + 4 offset + 4 size + 1 string + 1 terminator) to 3 bytes (1 vtable + 1 size + 1 string). Of course this very inflexible and special purpose, but gives users more options for compact data storage. Again, like all format variation above, this comes at no runtime cost, just some codegen complexity.varint
type for fields. They could be added to the existing format but make a lot more sense in a system with no alignment.@rw @mikkelfj @vglavnyy @mzaks @mustiikhalil @dnfield @dbaileychess @lu-wang-g @stewartmiles @alexames @paulovap @AustinSchuh @maxburke @svenk177 @jean-airoldie @krojew @iceb0y @evanw @KageKirin @llchan @schoetbi @evolutional