google / flatbuffers

FlatBuffers: Memory Efficient Serialization Library
https://flatbuffers.dev/
Apache License 2.0
23.4k stars 3.25k forks source link

Feature: Increased support for defaults in tables #6053

Open CasperN opened 4 years ago

CasperN commented 4 years ago

Motivation

This feature requests schema-specifiable defaults for all fields in tables. After this feature and #6014 are implemented, schema writers will have full control over the semantics of non-presence for all fields in tables :) Table fields may be optional (and non presence maps to None) or not (and non-presence maps to some default value).

As was the case with #6014, we should only do this if we have sufficient buy-in to get all major languages to add these features in a small time window, which means we won't get around to implementing this until after we're done with #6014. But there's no harm discussing this now.

High Level

Schema writers may specify:

Generated readers must return the specified default if a field is non-present. Generated builders must leave the field non-present if asked to write a default value. The generated type must not be optional.

Example Schema:

https://github.com/google/flatbuffers/blob/master/tests/more_defaults.fbs

enum ABC: int { A, B, C }

table MoreDefaults {
  ints: [int] = [];
  floats: [float] = [     ];
  empty_string: string = "";
  some_string: string = "some";
  abcs: [ABC] = [];
  bools: [bool] = [];
}

When reading this table

TODO

task owner done
Parser support @caspern #6421
Rust support @caspern #6421
C++ support @vglavnyy ?
Java support @paulovap ?
kotlin support @paulovap ?
Swift support @mustiikhalil #6461
lobster support @aardappel ?
C support @mikkelfj ?
C# support @dbaileychess ?
TS/JS support @krojew ?
Documentation updates @CasperN
CasperN commented 4 years ago

@mikkelfj, I think you were very interested in a feature like this?

mustiikhalil commented 4 years ago

@CasperN this is super interesting! I wonder how would this look like,

table M {
meaning_of_life: int = 42;
}
table Root {
main_table: M = default; // what should we allow the user to adjust here?!
}

because from my point of view, the default tables shouldn't be encoded with the buffer, so the default values provided in the actual schema should be taken instead. var mainTable: M { let offset = table.getOffset(); return offset == 0 ? MainTable(ZeroTable()) : table.getM() }

Strings can simply be created in place, however structs are going to be a bit tricky in swift since the create function and the read struct are a bit different. so we might need to pass some type of flag that indicates if the object is created by the user or it's a default: var lan: Double { return is_default ? 49.9 : struct.readLan() }

mikkelfj commented 4 years ago

@CasperN yes I am, but I'm also bit swamped in projects. Maybe use JSON for defaults rather than hacking up a schema format.

You may also be interested in: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#mixins

CasperN commented 4 years ago

@mustiikhalil,

because from my point of view, the default tables shouldn't be encoded with the buffer, so the default values provided in the actual schema should be taken instead.

I agree. If the root is non-present, then the returned object acts like an empty table. In your example, if root is an empty flatbuffer, root.main_table().meaning_of_life() == 42.

structs are going to be a bit tricky in swift since the create function and the read struct are a bit different

Yea, I agree. Figuring how to not write default structs will probably be annoying if they're read and written with different types.

@mikkelfj,

Maybe use JSON for defaults rather than hacking up a schema format.

Good idea. Using JSON to specify default structs seems to make sense. Our parser already can understand JSON. Also, #6014 uses null which is a json keyword. Perhaps then, default empty vectors and empty tables can be specified with = [] and = {} respectively.

You may also be interested in: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#mixins

Interesting, I think mixins are like associated constants which is a bit different from "a default interpretation of non-presence."

mustiikhalil commented 4 years ago

@mikkelfj but don't forget that most languages don't have their own JSON parser to parse a schema, and it would be more convenient if we stick to the flatbuffer objects that we already generated.

CasperN commented 4 years ago

don't forget that most languages don't have their own JSON parser to parse a schema, and it would be more convenient if we stick to the flatbuffer objects that we already generated.

I think JSON-style-default-structs need only be seen in C++, in the Parser's FieldDef. The code generators will translate the json into a native struct (or err trying) so the given language won't need a JSON parser.

mikkelfj commented 4 years ago

@mustiikhalil as @CasperN mentions, it is only needed in the C++ (and C) schema parser. It is actually worse in C because C's JSON support is code generated for a schema, but it is fairly easy to parse JSON slow and memory in-efficient. In fact one of the selling points of FlatBuffers is that you can have human friendly JSON as input.

Note that the C++ schema parser already allows for JSON objects following the schema for data encoding. I disgree with this and think it is mixing two separate concerns. The C flatcc compiler does not support these JSON objects. However, a JSON initializer is different.

mikkelfj commented 4 years ago

As to empty table / vector default syntax using = {} / [], that quickly becomes cumbersome. I'd prefer if they by default were default, but could revert to old behaviour with a command line switch, or alternatively using the null keyword, or a really_null notation.

aardappel commented 4 years ago

@CasperN generally agree that it would be nice to have this functionality, but even moreso than #6014 this is "expensive" in that it needs to be supported in all languages, and its quite a few different kinds of default structures to support. So personally, I'll be looking towards the progress in #6014 to see if this has a chance of succeeding, and I don't think we should start implementing this one until #6014 has reached.. lets say.. half the languages at least? :)

A concern for C++ (and maybe C and other low level languages) is that so far, any pointer returned from the FlatBuffers API points inside of the buffer, to data owned by the buffer, which will now be broken. For example, the C++ verifier will need to be adapted to understand this, and maybe other code. Not sure how the mutation API will deal with this feature, as the same API call can now return shared, static objects. @vglavnyy @mikkelfj

If we're going to have defaults, in particular for strings, then ideally we should also detect when value are equal to the default value and not serialize them, like we do for scalars. This could be an expensive feature however, and again interact badly with code that expects to be able to mutate in place.

I would say this feature should be limited to fields, i.e. places where we can specify defaults in the schema. Thus we can't have a default value root. Which is probably good.

@mikkelfj I don't agree that we want to move to the default for (reference) defaults being anything other than null. As I am trying to describe above, these default values are not necessarily that natural, so should be opted into carefully, not wholesale.

Of all the proposed possible defaults, I'd say that string defaults are very useful, empty vector default somewhat useful, empty table, union, and populated struct not that useful compared to their complexity. So I'd personally be ok not implementing it for all types.

mikkelfj commented 4 years ago

I agree that we should not rush into implementing this feature, but it is worth discussing.

I share the concern about pointer of buffer, and i have thought a bit about it. There should be a few known globals that are easy to check against.

As to not writing tables and strings that match a default value, I think that is too expensive, especially for nested tables, possibly not even fully resolvable when you think about shared objects.

Not doing default removal should also remove some concerns about returning defaults by default.

Populated structs and tables could be handled if they are removed when writing, but they are more useful on conjunction with mix-ins because it allows you to create complex worlds fairly easily.

vglavnyy commented 4 years ago

I think it will be hard to implement default tables (vectors of tables with vectors of tables). It will conflict with mutable API and will complicate JSON parsing/generation. As an alternative, it is possible to use a generated object-API for default tables. For example, I use a custom smart pointer that creates a default table if it is null. String defaults in schema declaration are interesting for using with an obj-API or with a reflection API. For example, the mini-reflection API would return an accessor to declared attributes.

aardappel commented 4 years ago

@mikkelfj maybe it is expensive in the general case, but I think omitting values that are equal to default should at least be possible for implementations that want to do so, especially for strings. Doing a string compare against a default value is not that expensive, since most such compares will fail within the first few bytes (or on the length), and it only has to be done for fields that have a string default specified (which has to specifically be opted into).

What I don't want is such a (space) optimisation to be impossible, because someone is deriving additional information from the string being present inside the buffer or not :)

I'd be fine with implementations choosing not to do such comparisons for tables/structs/non-empty vectors.. in fact I am not sure we even should have such default values in a first iteration of this feature, as @vglavnyy says.

@vglavnyy creating an empty table on the fly makes sense in the object API maybe, but not in the base API?

I'd prefer to only have default values for which we can create fully static POD data to refer to in C/C++.

aardappel commented 4 years ago

Also a note on languages like Java (and maybe Go, C# etc) where a string accessor either returns a new string object, or a ByteBuffer that refers to the original UTF-8 string data. When a string has a default, that second kind of accessor would now have to create (or cache somehow) a new underlying byte array, which may be less efficient or use more memory than intended, and complicates our code generation.

So yes, not a trivial feature :)

CasperN commented 4 years ago

I think it will be hard to implement default tables (vectors of tables with vectors of tables).

Correct me if I'm wrong, but can't a pointer into a static array of zeros (inside the normal accessor) be used as empty vectors and tables? When you try to deref that as an offset, you won't go anywhere. When you look for a length, it'll be zero. You can't mutate table fields or vector elements because its "empty".

mikkelfj commented 4 years ago

@CasperN A static array of (just a few) zeroes is trivial in C/C++, but not necessarily in other languages. In C/C++ you have the potential for conflicts when linking separate projects, or you have risk having separate arrays that are somewhat unique, but not entirely. I'm not against this at all, I am just saying that there are some concerns. In Javascript you have object identity, separate from equivalence, so you need a global unique object, but it shouldn't be too hard to achieve, although, as for C/C++, you might run into multi-project conflicts.

@aardappel I did indeed think a bit about the string case and I agree that sometimes you may want to omit default strings, perhaps even structs when they are properly zero-padded, but it should be an option. If you carefully avoid writing defaults, then doing these checks can be significantly slower, even if not "that slow", because if branching, function calls, missed optimisation opportunities etc.. Maybe a check for zero length would be OK, but more complex defaults would require opt-in.

Especially when parsing JSON, default removal is relevant, even for tables, where you can assume there are no shared objects. @vglavnyy It is not necessarily (more) difficult to handle this in JSON. In C code generator, there are already fast paths, fallbacks, and checks for scalars - and it is a fair bit ugly in the code generator. But you are right in the sense that it is more work because the parser implements is own builder logic on top of a generic typeless builder library.

But, if you start going there, it is only a short step towards deduplication in general, so the question is, how much overhead and complexity do you want to spend without going all the way? Of course deduplication is not trivial for tables and vectors. In C, the clone operation provides a custom hash table to manage lifetimes, and I think C++ has something similar in the object API which I am not familiar with.

CasperN commented 4 years ago

@mikkelfj

A static array of (just a few) zeroes is trivial in C/C++, but not necessarily in other languages. In C/C++ you have the potential for conflicts when linking separate projects, or you have risk having separate arrays that are somewhat unique, but not entire

Not to go too deep "into the weeds" but I'm pretty sure every language has some notion of a module level global constant. For C++, we can put it in flatbuffers.h. I'm not concerned about uniqueness of the zeros array, its probably 8 bytes or less. Compilers duplicate far more than that when inlining functions. Also, we (probably?) shouldn't be using pointer equality anyway, but that's another discussion.

As to not writing tables and strings that match a default value, I think that is too expensive, especially for nested tables, possibly not even fully resolvable when you think about shared objects... deduplication is not trivial for tables and vectors.

I think these concerns are ameliorated if we stuck to only allowing empty vectors and tables as defaults.

That doesn't seem too difficult to me.

but more complex defaults would require opt-in.

My 2c is that default-deduplication should be opt-out and we shouldn't implement the option until someone asks for it.

@aardappel

Also a note on languages like Java (and maybe Go, C# etc) where a string accessor either returns a new string object, or a ByteBuffer that refers to the original UTF-8 string data. When a string has a default, that second kind of accessor would now have to create (or cache somehow) a new underlying byte array, which may be less efficient or use more memory than intended...

Is it so hard to get an immutable or copy-on-write pointer to a static string in those languages? 😨

and complicates our code generation.

Yea, can't argue with that. I'm asking for a branch everytime codegen makes a field to do different behavior based on if the field is allowed to be null.

aardappel commented 4 years ago

@CasperN

I think these concerns are ameliorated if we stuck to only allowing empty vectors and tables as defaults.

I don't see the usefulness of "empty tables", frankly.

Is it so hard to get an immutable or copy-on-write pointer to a static string in those languages?

Yes, they have no such concept.

CasperN commented 4 years ago

Another reason against default tables is recursion. If we had default tables, then x.infinity().infinity().infinity().infinity() etc would return a value... which is weird.

This is already a thing with the (required) annotation. table Infinity { i: Infinity (required); } does compile by flatc but can't be constructed and will lead to a stack overflow or something if unpacked with the object API.

Okay, it seems like the only directions are

aardappel commented 4 years ago

Yup, would agree it be best to start with those 2. Maybe structs.

aardappel commented 4 years ago

We could consider a design where a default value must be stored once in the buffer, if used. That of course would make buffers bigger, but it would greatly simplify the implementation of this feature for all these languages that don't have pointers, and where even an empty vector would require another ByteBuffer to be allocated to work with the current accessor API.

In fact, if we store them once in the buffer, then reader APIs for all these languages will work unmodified (!) Writer APIs will complicate, because now we must track when a default was used, and worse, we'd have to do something sneaky, and possibly serialize these default values in between fields (which the current writer API forbids), and which makes vtables less likely shared.

So maybe not that great, but it is an idea.

Another thought: if a non-scalar type has a default, then it is essentially "required" not in the sense that the field must be set, but in the sense that the field will never be null, so the accessor should be of a non-null type if the language supports it.

CasperN commented 4 years ago

We could consider a design where a default value must be stored once in the buffer, if used

Is this essentially making the field as required and ForceDefault? That does have benefits for the mutation API.

if a non-scalar type has a default, then it is essentially "required" not in the sense that the field must be set, but in the sense that the field will never be null, so the accessor should be of a non-null type if the language supports it.

I agree that the accessor shouldn't be optionally typed.

Imo, this should be handled on the reader's side with "non-presence => default", mostly because I'm not a huge fan required fields and "non presence => crash" semantics.

aardappel commented 4 years ago

No, I was suggesting to store the value just once, so that wouldn't work well with the mutation api.

CasperN commented 4 years ago

So I recently found out that in some languages (e.g. swift), non-present vectors already map to empty vectors. I think we'll just have to live with this inconsistency, because making things consistent is backwards incompatible. The only remaining direction seems to be default values for strings.

aardappel commented 4 years ago

Yes, the intended semantics for FlatBuffers is that there is a difference between empty and null, but it is really up to each language API to implement that. If that is not possible of very inconvenient/inefficient, so be it.

CasperN commented 4 years ago

One way to deal with this in a backwards compatible, but more complex, way is to have three options:

(this feels like the type of compromise that will make everyone unhappy)

aardappel commented 4 years ago

The simpler way is for languages that model an empty vector have an additional method to determine if the thing is really present or not. The fact that the API ignores non-presence is a matter of convenience, but in some use cases it may be necessary to know the difference.

And I really don't want to formally endorse "it depends on the language" at the schema level.

CasperN commented 4 years ago

Yup, would agree it be best to start with those 2. Maybe structs.

Let's skip that one for now.


I think #6014 is advanced enough that we can have some hope of doing this thing. The current proposal is:

The behavior for these default-y fields matches that of scalars:

For Java, Swift and co who already made non-present and empty vectors look very similar, please ensure there's an IsPresent option to distinguish presence for the existing optional vectors (or negotiate a backwards incompatible change of types with your users).

Can I get a 👍 from the various participants before any work is started in this direction? @aardappel @mikkelfj @mustiikhalil @vglavnyy

mustiikhalil commented 4 years ago

@CasperN so that would mean that strings, and arrays would always return their default type, unless we specify that the field is optional. In swift, we already have optional strings, I just need to return the default instead of nil. Regarding vectors, we can always default them to an empty vector, and add the isPresent: Bool field in general. However, should we modify the count to return nil? example myVectorCount: Int32? { if !seen { return nil }; return x } // defaults to zero if the vector is not found

The string change won't effect the users of the swift library, but the array might change a little bit.

CasperN commented 4 years ago

so that would mean that strings, and arrays would always return their default type, unless we specify that the field is optional

No, if unspecified they will follow the existing behavior of being optional. This is unfortunately inconsistent with scalars which are default-y unless otherwise specified, but such is life.

In swift, we already have optional strings, I just need to return the default instead of nil

Yes, and presumably the return type should be String instead of String? since nil will not be used

Regarding vectors, we can always default them to an empty vector, and add the isPresent: Bool field in general. However, should we modify the count to return nil?

I wouldn't change it, but that's up to you.

aardappel commented 4 years ago

I would still suggest to push the scalar optionals to more languages before starting this (and get rid of the 2 schema), but generally I am ok with this going forward as @CasperN specified.

To be very clear, any currently existing non-scalar with no default value is as if you had written = null. In addition, we have (required) which says that null will never be returned, and some languages bake this into their API, but this doesn't affect the default value, which is still null.

A string/vector with a (as proposed) non-null default is NOT (required) (since it may indeed be not present on the wire). This is counter-intuitive, but will work best with the way required is interpreted currently, for example in the verifier. A language may choose however to generate a non-null accessor if the field either has a default OR is required. Specifying (required) on a field that has a default should probably be an error in the schema compiler, as it doesn't help any and may lead people to believe it does something different from merely specifying the default.

This is all indeed different from scalars, and one thing we should definitely do once both of these features are in, is to VERY clearly document this, in Schema.md and Tutorial.md

Their native types should be non-optional

not sure what that means, needs more context.

CasperN commented 4 years ago

Their native types should be non-optional

not sure what that means, needs more context.

Sure, let me use Rust as an example. The schema table Monster { name: string = "bob"; } should generate this accessor:

impl Monster<'a> {
  // accessor for default-y strings (and also required strings).
  // Note the type is not `Option<_>`.
  fn name(&self) -> &'a str;  
  // As opposed to the accessor type for optional strings:
  // fn name(&self) -> Option<&'a str>;
}

A string/vector with a (as proposed) non-null default is NOT (required) ... Specifying (required) on a field that has a default should probably be an error in the schema compiler.

Yep, I totally agree. As proposed, there are 3 mutually exclusive responses to non-presence of a value on the wire:

Minor note: Given this definition, it actually makes sense to enable required scalars, but I'm unilaterally declaring that out of scope for this issue.

aardappel commented 4 years ago

Required: non-presence causes an error (hopefully detected in the verifier, and not via segfault).

Actually, non-presence needs to be prevented by the writer, as most readers do not have verifiers. Java for example has no concept of a non-null reference, and will just return the same nullable type for all. If a required value was not written it will just cause a null pointer exception down the line. Even though we have no way of enforcing this on the reader side, we want to promise that any field that is required can be read without checking for null and will not cause a null pointer exception.

CasperN commented 3 years ago

Hi all, I updated the Parser and did the Rust implementation for reference. This is now ready for codegen changes in other languages!

I updated the top comment for tracking assuming the same people will do the same languages as in #6014, but please comment if you'd rather do a different language, need help, etc. Thank you everyone!

mustiikhalil commented 3 years ago

This should land on swift with the following PR #6461. In swift we limited the default arrays to scalars (excluding enums)

CasperN commented 3 years ago

@aardappel @krojew @mikkelfj @vglavnyy @dbaileychess Are we still interested in supporting this in Lobster, JS/TS, C, C++, Csharp, etc? Feel free to ask someone else to take over if you're too busy -- though I probably can only help with C++

mikkelfj commented 3 years ago

@CasperN I have no idea what the considered feature set is now, and what has been implemented in Rust?

CasperN commented 3 years ago

@mikkelfj I updated the original comment with an example schema and some bullet points. Does that help clarify the feature set to you?

mikkelfj commented 3 years ago

OK. I cannot commit to this short term, but seems reasonable. However, syntax for long strings might be needed. Linebreaks, continuatios. Also, structs are desireable. I'm not sure if flatc supports fixed length (zero-padded) strings in structs, but flatcc does. This can be very useful for things like device configuration.

aardappel commented 3 years ago

@CasperN yes, we eventually want this supported in all languages!

@vglavnyy can you do the C++ version? If not, Caspar could do it. @bjornharrtell can maybe help out with JS/TS if @krojew is too busy? Hoping @dbaileychess can do C# I will do Lobster of course, but there's no hurry there.

vglavnyy commented 3 years ago

I can, but only in September or October.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 6 months with no activity. Please comment or this will be closed in 14 days.

mjmahone commented 1 week ago

Is there any progress on C++ support? Especially for empty default vectors?

I noticed I can just add | IDLOptions::kCpp; to https://github.com/google/flatbuffers/blob/master/src/idl_parser.cpp#L2697 and the code that gets generated appears to be correct, at least for empty default vectors. I'm not sure where I'll run into issues, but I'm curious what's still missing?