Feature RFC: Flatbuffers support for Optional types.

CasperN commented 4 years ago

Note: This top comment is heavily edited. It was kept up with the current state until this issue closed.

Motivation

Flatbuffers should allow users to choose to use optional types. There has been some interest in distinguishing between default values (which are not stored in the binary) and some notion of None which the user controls.

Here are some links to previous interest in this idea:

Currently, a user can control a field's presence in the binary by specifying "force_defaults" and checking "IsFieldPresent" which is a bit of a hack. This proposal should define proper Flatbuffers optional types, which should be a better way of doing this. Use of this feature is only advisable for new fields, since changing default values is in general backwards-incompatible.

How do we represent this in the schema file?

We will specify it like so

table Monster { mana: int = null; }

This visually implies that optional types are at odds with default values and is "consistent" since the value to the right of the equals sign is what we interpret non-presence to mean.

Change to Schema Grammar:

field_decl = ident : type [ = (scalar | null) ] metadata ;

~We can add a field tag, e.g. "optional" or "no_default", that triggers this behavior. Hopefully no one is using those tags. Maybe we can make it specifiable to flatc, an "--optional-field-keyword-tag" flag, just in case people are using it and can't stop.~

How do we represent this in the binary?

We are going with option (A).

(A) Non-Presence means None

Instead of omitting zero-like values, the generated code must store them. Non-presence for optional fields no longer means "whatever the default is," now it means None. You can interpret it as "the default value is None". This also means we cannot specify both specify a non-null default and mark the field as optional.

Pros:

This seems more intuitive.
It aligns with the "force_defaults" + "IsFieldPresent" hacky manual approximation of this feature.
If Nones are more common than zero-likes then this will have smaller binaries.

Cons:

@aardappel thinks this is harder to implement

"making presence an indicator would require we pass this special field status down to the field construction code to override the current val == default check, which means slowdown, codegen and runtime changes in all languages.. whereas my "least likely to be used default" trick requires no changes"

~(B) Some Sentinel value means None~

In this scenario, zero-like values are still not stored. Instead we choose some "sentinel" value which we interpret to be None (e.g. int can use int_min and float can use some kind of Nan).

Pros:

@aardappel thinks this is easier to implement

"it requires the schema parser to set default values for you, and no changes anywhere else"
If zero-likes are more common than None then this will have smaller binaries

Cons:

Someone might want to use the sentinel value (Hyrum's law).
- This can be mitigated by publishing the sentinels and letting users decide whether they need the sentinels.
This probably won't work for fields representing raw bits.

How do we represent this in every language API?

We'll need to change the type signature of all generated code (building/reading/mutating/object/etc) around the optional type to signal its optional-ness. I think we should use the language's local standard for optional types. Suggestions:

Python: Optional[T].
Rust: Option<T>.
C++17 has std::optional<T> but its not obvious what to use for earlier versions. T* would work.
Java: Optional shows up in Java 1.8 and triggers autoboxing, so idk :/

The exact generated-API for a language should be discussed in the PR implementing this feature in that language.

Out of scope

(I'll add links if you make issues for these feature requests)

Syntactic types
Default values for strings and tables

TODO

task	owner	done
Change flatc to support schemas with optional types and cause an error if they're used in unsupported languages	@CasperN	#6026 ✅
Implement optional type API in C++	@vglavnyy	#6155 ✅
Implement optional type API in Java	@paulovap	#6212 ✅
Implement optional type API in Rust	@CasperN	#6034 ✅
Implement optional type API in Swift	@mustiikhalil	#6038 ✅
Implement optional type API in lobster	@aardappel	✅
Implement optional type API in Kotlin	@paulovap	#6115 ✅
Implement optional type API in Python	@rw?
Implement optional type API in Go	@rw?
Implement optional type API in C	@mikkelfj	✅
Implement optional type API in C#	@dbaileychess	#6217 ✅
Implement optional type API in Typescript/javacscript	@krojew	#6215 ✅
Php, Dart, etc ...	?
Update documentation to advertise this feature	@cneo	#6270 ✅

[edits]

added todo list
added points from the discussion to each section.
added out of scope section
Decision: go with (A) and use = null syntax in schema file (cross out alternatives)
Updated TODO list, finished parser, Rust in progress
Change to schema grammar, link to swift PR, Note at top
Added more languages to the TODO
Lobster support 🦞
Kotlin and C support
Java, C#, TS/JS support and docs, issue closed, no longer editing.

mikkelfj commented 4 years ago

The =null syntax will complicate the parser.

This is my absolute smallest concern. At least for flatcc it would be probably be easier than dig through attributes. And from what I have worked with the flatc parser, I doubt that would be a major headache, and it is not multiplied by 10 languages.

CasperN commented 4 years ago

Alright, we have Parser support and an example in Rust. I think we're ready to start implementing the codegen changes for our many languages. Thank you, @mustiikhalil for getting a head start for Swift in #6038.

mikkelfj commented 4 years ago

@CasperN can you please summarise the specs of the feature, including the language changes.?

CasperN commented 4 years ago

@mikkelfj, I tried to do so in the top comment, but let me know if this is clearer.

Changes:

The schema language grammar change is only in the scalar definition: scalar = boolean_constant | integer_constant | float_constant | null
Field declarations with = null are marked as optional by the parser
This is an error at code generation time unless:
- The field is in a table
- The field's type is a scalar type
- This feature is implemented for all the languages being generated
Languages are expected to generate code for these fields with their local variant of optional types (instead of just the scalar type).
- Non presence of optional scalar fields in a flatbuffer binary must be read into a value outside the scalar, like None/NULL/nil.
- Serializing a type with None/NULL/nil must be represented as non-presence (vtable pointer set to 0).

mikkelfj commented 4 years ago

Thanks, your top comment was fine, but a lot has happened since then, unless you edited it.

As to grammar, you probably meant:

field_decl = ident : type [ = scalar | null ] metadata ; Otherwise null also happens as an attribute, which of course it could, but we never discussed that, AFAIR.

CasperN commented 4 years ago

Thanks, your top comment was fine, but a lot has happened since then, unless you edited it.

Yep, I'm trying to keep the top comment up to date with the most current state

mikkelfj commented 4 years ago

I think it is worth allowing = null on tables and strings as well, just to indicate the semantics, and for consistency. It may be important for schema translations. For example a string may become a NOT NULL in a SQL schema like https://www.sqlservertutorial.net/sql-server-basics/sql-server-create-schema/ It may also have significance in protocol buffer schema translations, although I haven't looked into that recently.

Sometimes I work on projects that only, or in part only, use FlatBuffers to generate JSON because FlatBuffers provides convenient encoding and decoding and schema management..

mikkelfj commented 4 years ago

Also, as I see you copied my schema suggestion: I made a syntax error, should have been:

field_decl = ident : type [ = ( scalar | null ) ] metadata ;

Note the parentheses and literal null.

mikkelfj commented 4 years ago

If tables and strings can be null, the C API could adopt the following semantics:

if string is not present and it has not been assigned a default null value, the field will return an empty string instead of a null pointer.

if a table is not present and it has not been assigned a default null value, the field will return an empty table instead of a null pointer. If the table has required fields a warning will be issued and the field will return a null pointer instead of an empty table. Empty tables can still return field default values. On a second thought, this will probably be a future feature with a command line switch, until then null is returned as before.

These semantics allow for future string default values and table default values.

As for structs, if they have a default null value, they will be converted into a union of None, and the struct, but the struct will be stored inline in the table when present and the table fields will remain the same, i.e. no special type field for the struct union.

Scalars will remain their current accessors, but also get a union accessor like structs. If the ordinary field accessor is used instead of the union accessor, a zero value is returned when absent, or a null for structs.

Note that C has a <table>_<field>(_get | _union | _type ) syntax where _union, and _type are only available for unions so far, but could be added for all fields with a null default value. The _get suffix can be omitted unless a special command line option is given to avoid name conflicts. The union returns a struct with the type and value fields.

mustiikhalil commented 4 years ago

@CasperN the swift implementation is completed, I added it for both tables and object-API

aardappel commented 4 years ago

@mikkelfj currently for tables and strings, = null is implied, since it is the only default possible. If and when we add defaults other than that for these types, we can add specifying this default. Right now, that would only be confusing, since a string field with = null would seem to imply that other fields do not have this default, which isn't the case.

CasperN commented 4 years ago

Bump @paulovap, @vglavnyy for Java and C++

vglavnyy commented 4 years ago

@CasperN, I will look this at the week.

CasperN commented 4 years ago

@rw can you do this for go and/or python?

aardappel commented 4 years ago

Support in.. Lobster ;) https://github.com/google/flatbuffers/commit/77f966f89fcf76fa67485adf941ac6e90d28029c

So Lobster has built-in support for "nillable" types, but sadly not for scalars. So the accessors for optional scalars instead return a second boolean to indicate if the value was present. Not great, but workable.

CasperN commented 4 years ago

@vglavnyy, how's the C++ implementation going? If you haven's started / don't have time, I can do it too.

vglavnyy commented 4 years ago

@CasperN I apologize for the delay (I was extremely busy). It would be great if you could do it.

I've examined C++ code generator to make PR. I don't understand how to Implement it with raw-pointers as proposed @aardappel:

const int32_t* get_maybe_i32() const { ... }

It is impossible because a pointed value is stored as little-endian but a host can be a big-endian. With C++ we have alternatives:

Generate scalar in a struct (boxing) and return a pointer to generated struct (alignment issue on re-cast).
Add flatbuffres::optional<T>.
Use the std::optional<T> or an user-defined optional (abseil and etc).

I prefer std::optional or a user-defined optional type. This is the easiest and portable way to implement it.

@mikkelfj Do you have a plan to implement support of optional is fltacc?

aardappel commented 4 years ago

It is impossible because a pointed value is stored as little-endian but a host can be a big-endian.

Argh, you're right, that's a shame, I liked the simplicity of this idea.

I still like the idea of a pointer, since that's how optionality is encoded in the rest of the API (for struct/table/string/vector). All of them have the nullptr value.

To keep it the same here, we'd return a pointer to a flatbuffers::Scalar<T> which has a private T and an accessor that can do the byteswap. We could provide an overloaded operator* for easy access.

I like that better than optional for consistency, but optional has the advantage that it will be more familiar to C++ users, and in C++17 we can use the native one.

vglavnyy commented 4 years ago

@aardappel Could you look at #6092 for discussion?

mikkelfj commented 4 years ago

@mikkelfj Do you have a plan to implement support of optional is fltacc?

Yes flatcc needs this feature, but it is not a top priority right now. Others are welcome to step, after discussing how to implement.

mikkelfj commented 4 years ago

How is this represented in the binary bfbs schema?

mikkelfj commented 4 years ago

What happens if a union is made optional?

aardappel commented 4 years ago

@mikkelfj yes, this should probably be added as a bool to Field in https://github.com/google/flatbuffers/blob/master/reflection/reflection.fbs since it's a built-in attribute. @CasperN can you add this?

The current feature is only for scalars, so does not apply to unions. Unions already have optionality built-in thru its NONE value, and do not have a default value.

mikkelfj commented 4 years ago

The plan for a FlatCC implementation for optionals are as follows:

All scalar table fields marked as optional will behave exactly as before, except they will get 0 as the default value when absent, and they will get a new accessor method with the _option() suffix, instead of _get(), which returns a struct field of two values <is_null, value>, where is_null is the negated is_present() method, and the value is the same as the get() method (thus 0 if is_null is true).

In this way fast and simple access is maintained for ordinary use cases while it is possible to efficiently retrieve an optional type in a struct return value.

The builder removes the force_add method from optional fields because force_add is always in effect. There is currently no support for writing an option struct: either the value is written or it is not. This is not fully aligned with how unions behave, since you can add a union value as a struct of <type, value> with the type NONE in a single add method call, but it could be added later if needed.

JSON parsing and printing have global flags to force default values to be printed or stored. These flags will be ignored for optional fields so an absent field in text results in a absent field in a buffer and vice versa with no default value processing.

FlatCC JSON does not parse or print null values. Originally I thought this was needed for optional values, I then realised that it is a separate issue and the null could be equally support for other field types. Which is a can of worms, and have a separate branch with a new JSON parser design, so I didn't want to complicate things further before that gets merged.

mikkelfj commented 4 years ago

OK, I have added a branch named optional to flatcc with support for optional scalars, it is ready to be merged and is only missing bfbs generation, pending reflection schema update. EDIT: I also added the nullable attribute to the bfbs schema.

https://github.com/dvidelabs/flatcc/tree/optional

There is one tricky special case: enums that do not have a 0 element because these require an initialiser among the enum elements, which of course is not possible with null. This is not a problem if you do not need the default value, but since FlatCC maintains the numeric _get() accessor along with _option() and becaue _option().value has a defined meaning even when _option().is_null is true, flatcc opts to define the default numeric value to 0 for optional enums that are missing a zero element.

I added some extra OptionalFactor enum fields to the test case:

https://github.com/dvidelabs/flatcc/blob/optional/test/optional_scalars_test/optional_scalars_test.fbs https://github.com/dvidelabs/flatcc/blob/optional/test/optional_scalars_test/optional_scalars_test.c

mikkelfj commented 4 years ago

Note that the my above discussion on default values for nullable enums also apply to the binary schema export where the field needs a numerical default value, which I set to 0 for enums even when there is no 0 element, when the default is null.

aardappel commented 4 years ago

I'd say that if is_null is true, then value should really never be accessed. It is good to define it as always being 0 for consistency (I did the same in the Lobster implementation, and we should do it for all languages where the value of an optional is readable even when not present).

But it shouldn't affect things like enums, the fact that the enum is "out of range" is not important for a value that you're not suppose to use.

There may be cases where a user blindly wants to use value without checking is_null, and it being 0 is useful to them, but I am struggling to come up with examples.

mikkelfj commented 3 years ago

Now that reflection.fbs has been updated, flatcc has merged optional scalar feature to master.

jamescourtney commented 3 years ago

What is the intended behavior when using optional scalars as a key for a sorted vector? Has this been considered? I realize scalars are not frequently used with sorted vectors, but it did come to my mind.

vglavnyy commented 3 years ago

@jamescourtney I think that scalar key and optional should be mutually exclusive features. At least until the known issues with key aren't solved (#6099, #5928).

jamescourtney commented 3 years ago

@vglavnyy -- that was my take on the issue as well, and it sounds reasonable enough. That behavior needs to be codified in the compiler though.

mikkelfj commented 3 years ago

Making keys mutually exclusive with optional could make it difficult to convert a schema to optional later, and it make it difficult to represent certain database models. Instead it could be defined that null keys sort as 0.

mzaks commented 3 years ago

What about just absence of value in the "map" So say we have an array like this:

[
{
  "name": "A",
  "timeOfBirth": 123,
  "timeOfDeath": 345,
},
{
  "name": "B",
  "timeOfBirth": 120,
  "timeOfDeath": null,
},
{
  "name": "C",
  "timeOfBirth": 140,
  "timeOfDeath": 250,
}
]

When key is timeOfBirth the map / sorted array looks like this:

[
{
  "name": "B",
  "timeOfBirth": 120,
  "timeOfDeath": null,
},
{
  "name": "A",
  "timeOfBirth": 123,
  "timeOfDeath": 345,
},
{
  "name": "C",
  "timeOfBirth": 140,
  "timeOfDeath": 250,
}
]

when key is timeOfDeath it's:

[
{
  "name": "C",
  "timeOfBirth": 140,
  "timeOfDeath": 250,
},
{
  "name": "A",
  "timeOfBirth": 123,
  "timeOfDeath": 345,
}
]

mikkelfj commented 3 years ago

That is also an option. It would be a lot more work to implement in flatcc, and overlaps with a filtering concept that we do not currently have.

mzaks commented 3 years ago

It would be a lot more work to implement in flatcc, and overlaps with a filtering concept that we do not currently have.

Do I understand it correctly that the first part of the sentence is a negative and the second part a positive? 😀

mikkelfj commented 3 years ago

More neutral. There is the risk that we add some logic that would not fit a more general filtering work, and concretely for flatcc, if there need to be filtering complexity, it better be in a given framework.

The flatcc implementation is a LISP like nest of macros that ultimately rely on accessors doing their job. Making this work with optionals is fairly difficult. However, if a null keys is always filterered, it suffice to check for is_present() on a field. Bu then you would filter out strings thet are null as well, which is probably a good thing - maybe it crashes today?

mzaks commented 3 years ago

It depends how null is handled during comparison. It can be handled as smallest value, the sort doesn't have to be stable (I guess) and also there is anyways an issue if equal keys are present. OR does the API return a list/iterator of values back?

mikkelfj commented 3 years ago

dlatcc accessors are functions taking a table argument. For scalars flatcc returns 0 on null, for strings it returns null on null. Optional scalars also have an option accessor which is a function returning a struct with a value and a is_null field. The heavily parameterized sort logic cannot trivially switch over to selectively use a different access method for optionals. Nor can it trivially filter out data since it performs an inplace heap sort on the existing array, although it could in principle reduce the vector size of the result, leaving garbage at tail.

By treating scalar null as 0, flatcc works out of the box. But as I said, I'm not sure that strings work equally well. However, since strings are handled somewhat specially in the logic, it would be possible to test for null and treat as the smallest string, if that doesn't happen already.

mikkelfj commented 3 years ago

Note that flatcc sorts in-place in an existing buffer.

CasperN commented 3 years ago

What is the intended behavior when using optional scalars as a key for a sorted vector? Has this been considered? I realize scalars are not frequently used with sorted vectors, but it did come to my mind.

Strings are already optional and can be used as keys, so we should share the same behavior. It looks like when annotated with (key), the string becomes a required field, hence key and optional are mutually exclusive features

vglavnyy commented 3 years ago

Making keys mutually exclusive with optional could make it difficult to convert a schema to optional later,

At least until the known issues with key aren't solved (#6099, #5928).

That was about C++ implementation.

By treating scalar null as 0, flatcc works out of the box.

Optional scalars don't have defaults, so there are no reasons to use zero as a default value. In my mind, a non-presented optional key should have the lowest sorting order even less than the most negative value of that scalar type.

jamescourtney commented 3 years ago

Strings are already optional and can be used as keys, so we should share the same behavior. It looks like when annotated with (key), the string becomes a required field, hence key and optional are mutually exclusive features

Agree with this. The main difference between this and strings, which are implicitly optional, is that this can be caught at compile time. In reality, this is probably a corner case inside of a corner case, but it's still worth coming to consensus on whatever behavior is expected.

The subtle difference here between null and a normal default value is that the library can choose to redundantly include a default value in the payload (which my library does for compatibility reasons to address #5833 ), whereas null has no binary expression.

In my mind, a non-presented optional key should have the lowest sorting order even less than the most negative value of that scalar type.

If we do end up allowing these to be keys, then this makes sense to me. Defaulting to 0 or otherwise would create situations where null keys were interspersed with 0 keys, and the sort wouldn't appear to be correct. It would also break binary search.

mzaks commented 3 years ago

AFAIU (key) feature does not work properly when there are duplicate key values, because then it is a multimap and not a map. So it does not matter if multiple values are defaulting to the same key or null can be considered as key, the map is broken. Or rather it returns one of the values for a key and not all of the values. And this is IMHO a bigger issue. Or rather we hit this issue with all solutions except for the one where we filter out the null values. But then we should also filter out the default values. So I think wes should address the map / multimap issue first.

mikkelfj commented 3 years ago

FWIW: flatcc uses heap-sort in-place in an existing buffer so the sort is not stable if there are duplicate key values.

jamescourtney commented 3 years ago

AFAIU (key) feature does not work properly when there are duplicate key values, because then it is a multimap and not a map.

Are you referring to flexbuffer maps? I don't think FlatBuffers ever specifies that keys need be unique; key is just a hint saying "sort by this" -- if the user wants to treat it as a map without duplicates they are of course free to do so.

My point is that null and 0 are logically distinct values, so coalescing null into 0 for sorting purposes means that a binary search for null may return the element with 0 as the key and vice versa. We may as well not allow optionals to be keys in this case.

mzaks commented 3 years ago

What I mean is following. The API which user uses to extract a value for key, returns one object or null. As explained for example in this documentation:

... you can use the ByKey accessor to access elements of the vector, e.g.: monster.TestarrayoftablesByKey("Frodo") in C#, which returns an object of the corresponding table type, or null if not found...

https://github.com/google/flatbuffers/blob/6cea45dcd3ad84c94c0062d0211736bba53188f2/docs/source/CsharpUsage.md#storing-dictionaries-in-a-flatbuffer

So if you have duplicated keys in a sorted vector eg: [5, 6, 7, 7, 7, 7, 7, 9, 11, 12, 15]. Binary search will return the first 7 it will find. And as a user you will think that there is only one value with a key 7. Which is wrong. This problem is present in FlatBuffers (key) feature and in FlexBuffers. IMHO the issue in FlatBuffers is bigger because the (key) property was not necessarily designed to be a unique value, where a key in FlexBuffers Map implies that it is a unique key. E.g. you can have the similar problem with JSON {"name": "Maxim", "age": 39, "name": Alex} Which name will be taken "Maxim" or "Alex" is AFAIK undefined behaviour.

My point is that null and 0 are logically distinct values, so coalescing null into 0 for sorting purposes means that a binary search for null may return the element with 0 as the key and vice versa. We may as well not allow optionals to be keys in this case.

@jamescourtney I agree with you on the first part, but in the grand picture IMHO it does not matter if we collapse multiple keys into one value with null because we are doing it already with default and generally in case of duplicated keys. 0 and null should be distinguished values, but if you define something as nullable there will be many values behind this one key and users will get unexpected results. So I think it is important to give a user a possibility to easily iterate over all values behind one key. And if we have it, then if 0 and null are the same key, users can filter out values they do not want.

aardappel commented 3 years ago

Seems to me if you're trying to sort on a scalar key, you do not want that value to be optional. I'd suggest to make the combination an error in flatc rather than trying to give null a place in the sorting order.

vglavnyy commented 3 years ago

@CasperN Can you look at this problem with optional? https://github.com/google/flatbuffers/blob/96d5e359773987314c25fa5181adb77c064ac625/tests/optional_scalars.fbs#L51-L52

CasperN commented 3 years ago

@vglavnyy so the problem with optional enums also happens with other code generators. I didn't implement support for Rust early on and so no one else did either. I think it makes sense to support them, so I'll try and fix what I can.

vglavnyy commented 3 years ago

@CasperN I've just fixed the C++ code generator: #6155. Now it should work with Optional<Enum>.

google / flatbuffers