Feature RFC: Flatbuffers support for Optional types.

CasperN commented 4 years ago

Note: This top comment is heavily edited. It was kept up with the current state until this issue closed.

Motivation

Flatbuffers should allow users to choose to use optional types. There has been some interest in distinguishing between default values (which are not stored in the binary) and some notion of None which the user controls.

Here are some links to previous interest in this idea:

Currently, a user can control a field's presence in the binary by specifying "force_defaults" and checking "IsFieldPresent" which is a bit of a hack. This proposal should define proper Flatbuffers optional types, which should be a better way of doing this. Use of this feature is only advisable for new fields, since changing default values is in general backwards-incompatible.

How do we represent this in the schema file?

We will specify it like so

table Monster { mana: int = null; }

This visually implies that optional types are at odds with default values and is "consistent" since the value to the right of the equals sign is what we interpret non-presence to mean.

Change to Schema Grammar:

field_decl = ident : type [ = (scalar | null) ] metadata ;

~We can add a field tag, e.g. "optional" or "no_default", that triggers this behavior. Hopefully no one is using those tags. Maybe we can make it specifiable to flatc, an "--optional-field-keyword-tag" flag, just in case people are using it and can't stop.~

How do we represent this in the binary?

We are going with option (A).

(A) Non-Presence means None

Instead of omitting zero-like values, the generated code must store them. Non-presence for optional fields no longer means "whatever the default is," now it means None. You can interpret it as "the default value is None". This also means we cannot specify both specify a non-null default and mark the field as optional.

Pros:

This seems more intuitive.
It aligns with the "force_defaults" + "IsFieldPresent" hacky manual approximation of this feature.
If Nones are more common than zero-likes then this will have smaller binaries.

Cons:

@aardappel thinks this is harder to implement

"making presence an indicator would require we pass this special field status down to the field construction code to override the current val == default check, which means slowdown, codegen and runtime changes in all languages.. whereas my "least likely to be used default" trick requires no changes"

~(B) Some Sentinel value means None~

In this scenario, zero-like values are still not stored. Instead we choose some "sentinel" value which we interpret to be None (e.g. int can use int_min and float can use some kind of Nan).

Pros:

@aardappel thinks this is easier to implement

"it requires the schema parser to set default values for you, and no changes anywhere else"
If zero-likes are more common than None then this will have smaller binaries

Cons:

Someone might want to use the sentinel value (Hyrum's law).
- This can be mitigated by publishing the sentinels and letting users decide whether they need the sentinels.
This probably won't work for fields representing raw bits.

How do we represent this in every language API?

We'll need to change the type signature of all generated code (building/reading/mutating/object/etc) around the optional type to signal its optional-ness. I think we should use the language's local standard for optional types. Suggestions:

Python: Optional[T].
Rust: Option<T>.
C++17 has std::optional<T> but its not obvious what to use for earlier versions. T* would work.
Java: Optional shows up in Java 1.8 and triggers autoboxing, so idk :/

The exact generated-API for a language should be discussed in the PR implementing this feature in that language.

Out of scope

(I'll add links if you make issues for these feature requests)

Syntactic types
Default values for strings and tables

TODO

task	owner	done
Change flatc to support schemas with optional types and cause an error if they're used in unsupported languages	@CasperN	#6026 ✅
Implement optional type API in C++	@vglavnyy	#6155 ✅
Implement optional type API in Java	@paulovap	#6212 ✅
Implement optional type API in Rust	@CasperN	#6034 ✅
Implement optional type API in Swift	@mustiikhalil	#6038 ✅
Implement optional type API in lobster	@aardappel	✅
Implement optional type API in Kotlin	@paulovap	#6115 ✅
Implement optional type API in Python	@rw?
Implement optional type API in Go	@rw?
Implement optional type API in C	@mikkelfj	✅
Implement optional type API in C#	@dbaileychess	#6217 ✅
Implement optional type API in Typescript/javacscript	@krojew	#6215 ✅
Php, Dart, etc ...	?
Update documentation to advertise this feature	@cneo	#6270 ✅

[edits]

added todo list
added points from the discussion to each section.
added out of scope section
Decision: go with (A) and use = null syntax in schema file (cross out alternatives)
Updated TODO list, finished parser, Rust in progress
Change to schema grammar, link to swift PR, Note at top
Added more languages to the TODO
Lobster support 🦞
Kotlin and C support
Java, C#, TS/JS support and docs, issue closed, no longer editing.

aardappel commented 4 years ago

I mentioned no_default since all fields are already optional, in a way. But it's not a great name either :)

A:

The slowdown can be avoided if a language implementation has a second set of runtime functions for each kind of scalar that does not take a default at all, and the codegen calls this function instead for fields that have this attribute set. I'd say this would have to be the implementation technique if we decide on this approach, since passing another bool to test down to the existing functions would penalize code that doesn't use this feature, which is currently all of it.
Generally the bigger downside of this approach is that the feature won't work correctly until all languages implement it. I'd feel better about it if a PR implementing this feature would attempt to cover most languages at once.
In C++, an int field with this attribute set would return an int * which may be nullptr, which is equivalent to the current workaround, which is to make the field of type struct Integer { i:int; }. We could then decide to make this generate std::optional when generating code explicitly for C++17 and up, though frankly that is not actually an improvement in ergonomics. The fun thing about the int * is that it can actually point into the buffer, so doesn't actually touch the int in question.

B:

Attractiveness of this approach is that it can work reasonably even if languages don't support it explicitly. It is however not very ergonomic having to deal with these special values manually, though helper functions can make this better bool FieldPresent(int i) { return i != 0x80000000; }
This trick is already in use by some users. @lu-wang-g

aardappel commented 4 years ago

@vglavnyy @rw @paulovap @mustiikhalil @mzaks @dbaileychess @dnfield @mikkelfj @krojew ..

paulovap commented 4 years ago

A: On Java, Optional shows up on API 1.8, so we would have to set this API as minimum requirements. Also, the use of Optional would trigger autoboxing for scalars which is a performance penalty.

B: I have a question here. Is the plan to set up default sentinels for each type or allowing users to define their own sentinel? Like:

table Test {
    myInt: Int32 (sentinel=-1);
}

That way the user would have control of what sentinel to use.

mikkelfj commented 4 years ago

I agree this is an issue. I mentioned this in the FB2 discussion as analyze SQL NULL values.

I have been thinking about this problem,

A) using is present: this doesn't work well with tables and strings etc. because these have not default value. A null value would have to be an offset to a special null object that would have to be checked on access, or the current null value would have to change interpretation. On a related note I have been considering offering a dummy object instead of NULL when the C api discovers a null because that is a safer and simpler approach. You can ask if it is a dummy object explicitly, and otherwise access it safely but getting only dummy results.

B) sentinel values are not good, and if they are, they are better done by the end user for specific use cases, i.e. FlatBuffers need not change.

C) My proposal. Add an additional bit-vector field at vtable offset 0 for table that have optional values. The bit-vector is stored inline in the table, but could also be an external offset if is believed that a) the are large enough to make this worthwhile, and b) data sharing is likely.

Now you can access data just as before, without overhead, but you can explicitly ask if the field is marked as NULL, if you care. This can be very efficient, and the builder can easily add this information as it already has vtable vector in memory that it writes out, and can add the null vector at the same time.

mikkelfj commented 4 years ago

The reason why B) is not good is because you cannot roundtrip SQL data transparently. This is an absolute requirement for this feature.

CasperN commented 4 years ago

One thing we can do for (A) and the "simultaneous implementation problem," which I agree is a huge problem, instead of implementing it in a huge PR, is to not generate code for optional fields (and warn) if the language isn't ready yet. I think this will at least make code review easier.

Perhaps we can also give users the option to use existing codegen and zero-defaults if they really want it and promise they'll always check for field presence manually... but such an option is definitely a foot-gun.

I agree with the use of nullable pointers over optional for performance reasons. Rust should prefer Option<&mut T> (if it had a mutation API) since raw pointers are frowned upon in that language and it'll compile down to a nullable pointer anyway.

mikkelfj commented 4 years ago

Oh, regarding C) it might be strange if a NULL field returns 4 because 4 happens to be the default value. For this reason it makes sense to support a default null value also. That is effectively option B), but without the hack since you have precise NULL information.

EDIT: this would add runtime overhead because it would need to check the null table to see if default or null default should be returned. But at least it only happens for absent values.

On Wouters suggestion to have a parallel set of accessors. That would add a lot of loud on compilers to strip out even more templates in C++ or macros generated inlines in C, and complicate code generation quite a bit in C.

mikkelfj commented 4 years ago

We could also require the default values only apply for non-optional fields, and always store values in optional fields, unless they are NULL, and have the bit vector. The default value then becomes the return value if NULL, and NULL can be checked separately.

aardappel commented 4 years ago

@paulovap the sentinal IS the default, so if you wanted to use -1, you could already write myInt:int = -1 already :)

@mikkelfj certainly do not want to add extra encoding to the binary to support this feature, I think that is too high a cost for such a simple feature, even compared to A).

@CasperN we'd have to actively error-out if someone compiles with an unsupported language, rather than ignoring the field, I think, like we currently do with other unsupported features. Doing that in multiple PRs would be ok, as long there is an effort to support most languages.

mustiikhalil commented 4 years ago

As @aardappel already said, most of the languages can return a null value (from the ones I’ve worked with) so why not introduce an optional flag ‘age: int (optional)’ and instead of returning a default value we just return null. Most of languages has some type of optional, if they don’t we can mark the name of the object as optional when we generate the code. This won’t require us to play with the binary data, and would only require small refactoring for each of the language generators. As as example swift already can return optional structs, tables, strings (we are just missing scaler optional)

mikkelfj commented 4 years ago

Well, if the cost is too high, I think we should not do it. I'd rather not have a half-baked solution on top of an already half-baked solution.

aardappel commented 4 years ago

@mikkelfj what is half baked about A) ? only scalars require this feature, all other types can already be null.

CasperN commented 4 years ago

@mikkelfj

Can you elaborate on

A) using is present: this doesn't work well with tables and strings etc. because these have not default value. A null value would have to be an offset to a special null object that would have to be checked on access, or the current null value would have to change interpretation

I'm not sure what you mean by special null object, can you give an example?

vglavnyy commented 4 years ago

With C++ the simplest way is to embed an optional scalar into a struct.

table Foo{
 bar : int (boxed);
}
// implicitly translated to:
table Foo{
  struct _ { bar : int = 0 }; // anonymous transparent proxy
}
//, or translated to
table Foo{
  struct bar { _ : int = 0 }; // named proxy
}

Struct might be present or not. With C++17 it is possible to generate Optional or Expected for any pointer-like (non-scalar) fields. In a json it can be like this:

"Foo" : {
 "bar" : [-12]
}
// or 
"Foo" : {
 "bar" : { "_" :-12 }
}

krojew commented 4 years ago

I'm for option A, since option B is what we have now in practice. On how to implement A - well, that's another problem. As far as I know, all our supported languages (maybe except Lobster, whatever that is 😄 ) have some notion of optional data, so we're fine in terms of compatibility. Of course, this would require making changes everywhere, but I personally, am fine with such cost if we finally get proper optional data support for scalars.

mikkelfj commented 4 years ago

@aardappel

what is half baked about A) ? only scalars require this feature, all other types can already be null.

First, the current use of null and default values is half-baked because strings and other non-inline data do not have default values, different from scalars. We could add default values for strings.

Second, hijacking presence for NULL is half-baked because it hijacks the current use of default values for a differen purposes. Granted, it can work if you change schema semantics for optional values, but now you have yet another layer of whatever presence and null offsets means, depending on a schema flag.

I'd rather have null as a clearly stated non-negotiable data point for all imaginable data types. If not a bit vector, it could also be a special vtable offset value like 0xffff - it takes up more space, but only affects verifiers since non-supporting implementations would just see a future version of the format and choose the default value of the field, while supporting implementations would only take a performance hit if detecting the field is out of range, which it has to check anyway.

@CasperN

I'm not sure what you mean by special null object, can you give an example?

It was actually Wouter's suggestion originally when I once mentioned the lack of Null transparency for SQL data.

The idea is that to avoid interfering with the current null interpretation, you store a special object in the buffer. The uoffset (not voffset) poins to this object for NULL entries. It can be added to the buffer as an object with just an empty vtable (+/- handling of required fields for the type, but ideally just one object for all types, but NULLable shouldn't have required fields?). Any user level code can do this today, except it is hard to tell if an object is the NULL object or some other object. One could argue that you could just store and actuall NULL instead, but that changes the semantics of the current system and would prevent adding default values for strings and tables etc.

mikkelfj commented 4 years ago

To extract a key point from my above comment:

D) store NULL values as virtual table voffset 0xffff because it is transparent to current implementations and acts the same as a missing entry with voffset 0x0000 except it looks like a future version instead of entirely missing. in both cases a default value is returned. It is now possible to add a separate support function: IsNull() similar to IsPresent(), but it avoids overloading current semantics, and it does not introduce a new bitmap data structure as my proposal C). The benefit of C) is that you can have short vtables if all the last fields are NULL, but I think D) is probably preferable.

A larger benefit of D is that you entirely avoid changing the schema semantics with an 'optional' setting because you can just store as value as NULL by adding an API function: SetNull() in the builder. Of course you could add attributes to block, or allow NULL, but it is not required.

mikkelfj commented 4 years ago

Sorry, I'm not thinking straight. You cannot transparently store 0xffff as a vtable offset. Future versioning requires a field in the vtable outside the currently expected max vtable size, but you cannot store invalid offsets within the table because common reader logic will not check the validity of this offset, only verifiers will. The approach could still work if all readers learned to check this similar to the 0 check they perform now, but that is not practical. That is why the bitmap approach is better.

CasperN commented 4 years ago

@mikkelfj

Second, hijacking presence for NULL is half-baked because it hijacks the current use of default values for a differen purposes. Granted, it can work if you change schema semantics for optional values, but now you have yet another layer of whatever presence and null offsets means, depending on a schema flag.

I see it as as having the same semantics, just that the default value is a value outside of the type of the scalar itself, NULL.

I might be misinterpreting you but I think you're looking for a default vs null system for vectors, tables, and strings (in addition to scalars). The current state of the world is that scalars are default-y and the more complex objects are null-y. This proposal allows users to choose default-y or null-y semantics for scalars but you're worried that it may complicate a future feature where users can choose to make the complex stuff default-y. Therefore this proposal should also design a solution so tables, strings, and vectors can be default-y.

On the bitmap thing,

I'm not sure of the value of introducing bits to track "default value" vs "null value". It seems to me like a premature optimization of a case where some default and null values are very common, that has a cost on every other use case, though I'm not sure what cases you have in mind.

aardappel commented 4 years ago

@mikkelfj

First, the current use of null and default values is half-baked because strings and other non-inline data do not have default values, different from scalars. We could add default values for strings.

I believe that is a separate issue. Not having defaults is maybe not great, but it doesn't affect the string's ability to be optional in the sense of this issue. In fact, non-inline data (and structs) are ahead of scalars in this sense.

There are "issues" with supporting default values. For strings it is reasonably easy, but some form of default table is not exactly a lightweight feature.

We're not hijacking anything, we're merely extending the existing semantics. Under proposal A), an int (optional) would have the same semantics and representation (and almost the same API) as struct Int { i:int }. It is not that new.. for me the big deal is getting all languages to support it.

And I am not at all familiar with SQL, so I really don't follow your concerns around null in general. I am mostly familiar with the nulls in C/C++/Java.. and I believe Flatbuffers follows a similar model.

mikkelfj commented 4 years ago

And I am not at all familiar with SQL, so I really don't follow your concerns around null in general.

NULL in SQL is controversial because the original designers (see book by C.J. Date on SQL) considered it impure in an otherwise elegant and consistent model based on algebra. However, this leads to requiring a defined 0-value value (like an empty string, or a numeric -9999, or whatever) in the domain of any given type, such as integers or strings. Pragmatics found it better to explicitly have a NULL value outside the value domain when they had lots of tables with data in which only few contained actual data. You can also select records where fields are not NULL or vice versa without having to define what a 0-value is in a given context.

In actual SQL use all of this can be condensed down to: If you have data extracted from a database that you want to process and reinsert into the database, you can expect to see NULL values, and you cannot consistently assume that you can map NULL to some other value (a 0-value) because that might overlap with a value meaning something else. So if you need to reinsert data without loss, you must keep track of the NULLs. Of course this is a mess, but a practical necessity.

Now, for this to work with FlatBuffers, you need to know exactly what is a NULL and not just what is NULL by convention wether it be a NULL string, or a default scalar value or an absent scalar that is different from a present default value.

If you add optional types to the schema, you can impose exact semantics for NULL simply by defining what NULL means for the given type and make sure that the equivalent type for SQL usage can represent all values in addition to NULL. This can actually work, but it is somewhat messy and hard to explain, but can be helped with API support.

If you do not modify the schema but just decides what constitutes a NULL, you are open to all sorts of interpretations in present use which will not work.

Optional types corresponds to SQL schema types where a field can be allowed to be NULLable, or not.

You also need to consider how the data can be represented in JSON without loosing NULL, but this should be workable.

I'd prefer a bitmap because it takes all guesswork away for any current and future types. It can be directly exported to other data formats such as Arrow that also uses bitmaps for NULLs, and it is fast to process because you can just scan the bitmap rather than special casing each kind of value that can be NULL.

Bitmaps do not interfere with current implementations because they are just an extra field in a table that you are not forced to look at if you do not care, but of course you cannot write NULLs if your API does not support it, unless you opt to write the NULL bitmap yourself, which is possible because it would just appear as a ubyte array.

So, if you do not want bitmaps, it can still be made to work if you introduce the optional attribute in the schema, but if you do not, it will be a mess.

mzaks commented 4 years ago

What about introducing the optional scalar types:

Byte Ubyte Bool
16 bit: Short Ushort
32 bit: Int Uint Float
64 bit: Long Ulong Double

where compiler when sees those per default creates structs:

struct Int {value:int};
struct UInt {value:int};
etc...

And languages which support actual optionals can replace this default behaviour by generating an API with optionals. This way there is an option to have a "small" non breaking change which will work for all languages and opens up gradual migration path for every language specific code generator.

PS: If by any chance people already have a Int type in their schemas, we can introduce a flag to suppress the Boxed scalar struct creation.

CasperN commented 4 years ago

@mikkelfj

[null] leads to requiring a defined 0-value value (like an empty string, or a numeric -9999, or whatever) in the domain of any given type, such as integers or strings. Pragmatics found it better to explicitly have a NULL value outside the value domain

This seems like options (B) and (A) respectively.

you can expect to see NULL values, and you cannot consistently assume that you can map NULL to some other value (a 0-value) because that might overlap with a value meaning something else. So if you need to reinsert data without loss, you must keep track of the NULLs ... Now, for this to work with FlatBuffers, you need to know exactly what is a NULL and not just what is NULL by convention whether it be a NULL string, or a default scalar value or an absent scalar that is different from a present default value.

I think we know exactly what is a NULL under (A) for optional scalars in a table. They are null if and only if IsPresent == False.

If you do not modify the schema but just decides what constitutes a NULL, you are open to all sorts of interpretations in present use which will not work.

I'm focusing on the words "in present use": I think you mean fields that are currently scalars cannot safely be redefined into optional scalars because we cannot distinguish between NULL and default. This is correct and intended. For existing fields, changing the optional-ness is a bad idea (and users will need to be educated about that).

Adding bits, as you suggest in option (c), will allow for retrofitting null onto existing scalars. Are you claiming retrofit-ability should be a requirement?

Fwiw, my opinion is that allowing existing scalars to convert into being optional is not worth the extra bits of option (c) and educating users that "optional (or lack thereof) is permanent" is sufficient for this feature.

You also need to consider how the data can be represented in JSON without loosing NULL, but this should be workable.

Good point. At first glance, json's null keyword/literal seems like a good fit to me.

@mzaks

I like your idea because it lets all languages generate a default API for optionals, though it is essentially syntactic sugar in the schema. What would happen to the Int/Uint after the language in question adds optional support? Would that be a code-breaking change?

Regardless, I think we should start with the "error-out if someone compiles [optional scalars] with an unsupported language" approach, since its a lot easier. Your idea should be revisited in the scenario where we go ahead with option (A) but haven't implemented optional scalars in all languages and really need to. However, I don't think we'll end up in that situation since a prerequisite of accepting this feature is confirming we'll actually do the work in enough languages.

@aardappel, what would you like to see before deciding between (a) and (b) and giving a go/no-go on this feature? What is the minimum set of languages that must support this feature for it to be worth it?

I can volunteer for the following tasks (unless someone else wants them):

Changes to flatc (adding the tbd optional field attribute and error-ing out)
C++ generated API changes
Rust generated API changes

mikkelfj commented 4 years ago

Adding bits, as you suggest in option (c), will allow for retrofitting null onto existing scalars. Are you claiming retrofit-ability should be a requirement?

One the one hand it would be useful, on the other hand it still requires marking the table as having NULLable fields, as opposed to marking individual fields. Bitmaps also provide fast processing of data which can be rather helpful, and it takes up less space than forcing default writes on some cases.

But mostly it rubs me the wrong way that you can declare -4 to be the default value, but it really means nothing, since -4 will be stored if not NULL, and if NULL, well, it doesn't really matter what the default value is.

mzaks commented 4 years ago

Here is how I imagined my proposal in detail:

Make following changes to parser and the AST model. Make sure that parser accepts the new optional types (e.g. Int) and marks which of them were used in schema. Add a new bool property to AST model representing structs (e.g. syntaticType: bool). Generate AST models for marked optional types with syntaticType = true.
Start changing the code generators to not generate types for syntaticType = true, but use native types like Option<i32> in Rust.

Step one can be done by one person and will work even without step two, the "old" code generators will just generate code for the Int struct and use this type for table fields. The new generators which know about syntaticType will not generate the Int struct, but type the fields in the table appropriately (e.g. Option<i32> in Rust).

Potentially the syntaticType concept could be improved upon to support more native types for a language e.g.:

struct UUID (syntaticType) {
  bytes: [u8];
}

Where in all languages which have a native UUID type (e.g. Swift), the native type is used and for other languages a struct is generated.

aardappel commented 4 years ago

@mikkelfj from what you say, you'd be happy with A) also, it encodes a null value out of the domain explicitly. Adding bits wouldn't make this any different, they are redundant.

@mzaks that could work, but I wouldn't want to clutter generated code with all these definitions for these auto generated structs. Option A) is what you get if you instead access the scalar directly as if it was a struct. I'd prefer that to adding new types.

@CasperN option B) is trivial, it is merely a convenience feature on the status quo. It is just if (attr == "optional") default_val = INT_MIN. I am "warming up" to option A), but I'd like whoever person(s) commit to implementing this also commit to a "majority" of languages over not too long a time. They don't have to do it all themselves, but maybe help in tracking and coordinating with language authors (quite a few have been active on this thread already).

CasperN commented 4 years ago

@aardappel, I can do tracking/coordination too.

@mzaks I'm a little wary of making that part of making your SyntacticTypes feature part of optional scalars since it seems like a bit of additional work. I think it should be a separate feature request.

Can everyone participating vote on A vs B vs do-nothing and maybe volunteer a language to implement?

I support (A). I'll track work by editing the top comment

aardappel commented 4 years ago

Ok, @CasperN is in charge of this feature :) I'll vote (A), then (B), then (do-nothing) in that order. I can help with C++ if no-one else takes it on.. @vglavnyy @yaoshengzhe ?

vglavnyy commented 4 years ago

What is the plan for implementing of (A)? In fact, all scalars in binary already are optional values. A scalar value is present or not. We only need a special attribute:

to enable the generation of an additional getter with MayBe return value;
to modify implementation of setter (local force_defaults).

That’s all. These changes are compatible with already existing binaries and user programs. The real problem is generated Object API. This attribute will change a type of a scalar inside T-generated tables.

Example:

table Monster { mana : int = -1 (maybe); }

class Monster {
  int16_t mana() const { return value or default; }
  flatbuffres::optional<int16_t>  get_mana() const;
  void add_mana(int16_t mana) {
    fbb_.AddElement<int16_t>(Monster::VT_MANA, mana); // WITHOUT default value!
  }
};
// Optional for default-constructable and copyable T.
// Can have a special implementation if T is a pointer.
template <class T > class optional {
  optional(); // none with T{};
  optional(bool none, const T& val); // only with explicit default value
  T& value(); // return ref to value or ref to default value (never throw)
  T& operator*(); // return value()
  operator bool() const;
  bool has_value() const;
  std::optional<T> get() const; // down-cast to std::optional<T>
  operator std::optional<T>() const; // implicit down-cast to std::optional<T>
};

CasperN commented 4 years ago

@vglavnyy,

In fact, all scalars in binary already are optional values. A scalar value is present or not.

Currently they're optional in the binary, but non-presence is interpreted as "return the default value according to the schema" so from the perspective of user's code, "all values are there". Proposal (A) allows non-presence to be interpreted to mean None (ie "Nothing" in the words of haskell's optional type, "maybe"). That is semantically at odds with "non-presence means default".

I think your table:

table Monster { mana : int = -1 (maybe); }

should be an error because I think there shouldn't be a non-null default for optional types. The following is fine.

table Monster { mana : int (maybe); }

Minor bike-shedding idea, we could specify optional types like so:

table Monster { mana : int = null; }

This visually implies that optional types are at odds with default values and is "consistent" since the value to the right of the equals sign is what we interpret non-presence to mean.

The real problem is generated Object API. This attribute will change a type of a scalar inside T-generated tables.

As discussed above, changing a field from default-y to optional is a binary-semantic breaking change. It is also intended to be a code breaking change (in all languages except maybe python) since we want to show, in code, that the type is optional. And not just in the object API, even the normal accessor should change: e.g. optional<int16_t> mana(); or int16_t* mana(); (the exact form should be debated in the C++ PR).

What is the plan for implementing of (A)?

First, flatc should recognize whatever syntax we choose to represent optional values and error-out when optionals are detected.
For each language, implement support for optional types (with whatever API is best for the native language). In that PR, remove the flatc error when generating code in that language.
When we have enough languages supporting this, then start changing the documentation and advertise the feature to users.

mikkelfj commented 4 years ago

I agree with @CasperN on why an optional type is needed. But rather than using an attribute, Casper alternatively suggests:

table Monster { mana : int = null; }

That is not bikeshedding, that is a great proposal because it removes the ugliness of having to explain why a default value is not allow for optionals, or alternatively figure out how to handle them in the context of optional. If you dislike the null keyword, it could also be none.

One potential issue is when converting existing data to a nullable type. In this case absent values gets converted from, say -1, to null. You might also want be able to return a default value when a value is null, rather than having to check for it. Essentially, you want the appearance of B by default, even if you use A. For this reason, I'm not sure it is such as good idea after all.

If we do this, I am almost sold on A without bitmaps.

Maybe we should add support for default values of strings such that one does not have to get null exceptions when reding null values by default. Here a strings default could be null, "", or "killroy wansn't here". Note that A) actually eats into the existing value domain because null is effectively part of the string domain as it is - unlike the default values of scalars, and the same for tables and unions.

As for tables I'd like a default value for those as well. Just the empty table with an empty voffset table.

Tables with required fields shouldn't be nullable. This prevents having a default table value.

In FlatCC I'm inclined to always return a value for optional types. The default value can still be null for strings and tables, but it has to be stated explicitly or the default will be an empty string or an empty table.

CasperN commented 4 years ago

One potential issue is when converting existing data to a nullable type...

Let's avoid the issue by saying "please don't do that." The official documentation says "You generally do not want to change default values after they're initially defined." That applies here too, doubly, because it changes the binary semantics and also breaks code.

Maybe we should add support for default values of strings ... As for tables I'd like a default value for those as well

I agree, but please open a new issue for that.

vglavnyy commented 4 years ago

For example:

a server uses an old schema without an optional field 'mana'.
a client has updated schema with the mana field.

All client programs should know the default value if mana() isn't present in messages from the server. Generate or use any addition sounded methods to get this default will be extremely terrible.

 auto opt = m.get_mana();
 // in another routine:
 opt.value_or(m.mana_default()); // here I don't want to know about monster `m` and about `mana` name.

This is isn't a big problem, I can write an ugly macro to generate my own optional value by a field name. But I sure that optional field should have declared default value.

CasperN commented 4 years ago

@vglavnyy I think you're offering an interpretation for non-null default values for optional types: That value will be available as a constant somewhere (e.g. we generate mana_default()) - though I'm not sure what would it return if the user didn't specify a default. The obvious options are null, 0, and 'not generated'.

This interpretation allows the user to track more information in the schema at the cost of a little code bloat and usage complexity. Note that, optional-defaults will be stored in the binary, unlike normal defaults, which might be surprising. What does everyone think of the complexity trade offs? I could go either way.

mikkelfj commented 4 years ago

Based on my previous comment, I now find it quite natural that the default value of an optional field is used as the return value for normal field access such that the field accessor function can remain within its value domain, and such that applications that do not care, do not have to care. Applications that need to check for null simply call field.IsNull, or field.IsPresent().

I have been given this some more thought: if a safe value is always returned, including empty tables, then the only unsafe access is when a union type not inspected, an offset to structs because it is not easy to provide a default struct if one isn't specified and structs could be large. Fortunately such structs only happen as union members, so it falls into the first category: check the union type before access.

Additionally, what happens if a union value is not None, but holds a value of a known type which is Null? This is already a problem today: FlatCC does some interpretation on these border cases, and I think it is documented in the flatcc binary format document, but I don't recall the details up front.

In any case, I think Null values should always map to the union None type even though this results in a loss of type information.

CasperN commented 4 years ago

@mikkelfj

I now find it quite natural that the default value of an optional field is used as the return value for normal field access such that the field accessor function can remain within its value domain, and such that applications that do not care, do not have to care. Applications that need to check for null simply call field.IsNull, or field.IsPresent()

It sounds like you want optional fields to be implemented as a field-specific force-default, without changes to the generated API. While it makes existing default-y code compatible with optionals, I think that's a bad idea from a semantics standpoint. If a user provides a schema where the nullability of data is important, the type system should force them to confront that. Programmers will forget to check IsPresent, turning a compile error into a runtime logic error. Also, this system is effectively the current state of the world and is inconsistent with how nullability works with strings, vectors, unions, and structs. Those use nullable pointers.

I have been given this some more thought: if a safe value is always returned...

I think you want a "safe" default-like value to always be returned (like how it is in Flexbuffers). It is just a coincidence that optional types are sometimes "unsafe" in c because they're represented with nullable pointers that can fail to be dereferenced... but we could use optional<T>, that's safe. Imo, nullable pointers are common and therefore "safe enough" for C users, but this is a language specific implementation discussion.

The problem with always returning a default value is that there absolutely are use cases for optional types. You mentioned working with SQL, which is a great example. The schema-writer decides whether they want optionals or defaults. We must not choose for them. This is literally the most common complaint with proto3.

union type

I think that's out of scope. At risk of expanding the scope of this conversation, I think a consistent nullable-vs-default story would look like:

type	if schema doesn't say	specifiable defaults	optional
scalars	zero like default	✔️ users can specify any value	this issue
tables	optional	empty table only	✔️
strings	optional	users can specify a static string	✔️
vectors	optional	empty vector only	✔️
structs	optional	needs new syntax so users can specify a default	✔️
union	optional	needs more new syntax so users can specify a default	✔️

The last 5 rows are design and implementation work is required to users to opt into safe defaults and avoid nullable pointers. I'll be happy to work with you on defaults-for-all after this issue, but please let it be out of scope for now.

Aside, my flatbuffers 2 request is consistency in the "if schema doesn't say" column

mikkelfj commented 4 years ago

https://www.lucidchart.com/techblog/2015/08/31/the-worst-mistake-of-computer-science/

CasperN commented 4 years ago

@mikkelfj

https://www.lucidchart.com/techblog/2015/08/31/the-worst-mistake-of-computer-science/ The problem with NULL is that it is a non-value value, a sentinel, a special case that was lumped in with everything else.

I think this is an argument against (B) and option (A) is winning 2-0 as of this comment. The author of that article is in support of optional types that can't be implicitly downcast to the thing they contain, like Haskell's Maybe a or Rust's Option<T>. I fully agree with them, intend to use Option<T> in Rust, and suggest using std::optional<T> in C++.

Are you voting for "do nothing" then?

mikkelfj commented 4 years ago

No, since there is no traction for C, it must be A, because we need NULL in the real world.

But that doesn't mean that NULL should be forced in the access API for the reason of ugliness, safety, ease of use, and elegance. That is why I discussed the use of default values to present NULL when attempting to read NULL within the value domain. When that isn't sufficient, you can check for NULL. You can't check for NULL on scalars anyway, but just reading the scalars, but some languages can wrap them into an option type. (In C this happens with unions because a union is actually returned as a two field struct, but that is not reasonable for scalar values). There is also a performance aspect.

Most often you have:

if value not null then x = value else x = 42; proceed with algorithm.

You can avoid this by taking advantage of the default values. You also often find dict.get(key, defaultvalue) in APIs which are highly useful.

mikkelfj commented 4 years ago

Note that this has the convenience of B) without actually hijacking the value domain. You still have exact NULL when needed, but when you don't need it for a particular computation, you can pretend to have B.

CasperN commented 4 years ago

@mikkelfj

Most often you have:

if value not null then x = value else x = 42; proceed with algorithm.

You can avoid this by taking advantage of the default values.

I think @vglavnyy 's interpretation of default values for optional types might be the way to go then.

table Monster { int mana = 42 (maybe); }

class Monster {
  std::optional<int> mana();  // std::nullopt for non-present values
  int mana_or_default();  // 42 for non-present values
  inline static int kManaDefault = 42;
}

What do you think of that?

Another alternative I thought of is introducing a custom type flatbuffers_optional<int,default=42>, which I'm a bit less comfortable with since not all languages can parameterize by values. It could lead to namespace pollution.

it must be A, because we need NULL in the real world.

Awesome, would you be willing to do the implementation for some language?

aardappel commented 4 years ago

I would agree that this new kind of optional scalar field should NOT allow a default, either by disallowing it (if its an attribute), or replacing it (by using = null, which I also quite like).

I don't see the need for there to be an accessor that still accesses these fields "blindly" by additionally providing a default. Though I am general in favor of such APIs, if you explicitly mark this field as optional, you are indicating you want to be precise about its presence. If you cared for an easy API you should be using a regular field.

And the whole point of such optional fields is maybe also that a default is not easy to find, i.e. all values are legit. Besides, it would give us 3 kinds of fields: standard, optional+default, optional+no_default, and that's frankly more complexity than its worth (remember, we still have to support this is 10+ languages).

So if you receive old data using a new schema that has such a field, then yes, the default value of that field will be null. You must check for it. I don't think this will be a problem.

@vglavnyy by default, I think the value of an optional int field should be const int *, as we return const * types for all fields that may be null currently. We can look into supporting std::optional if we're generating for C++17, but I don't think we should generate our own optional type.

mikkelfj commented 4 years ago

Awesome, would you be willing to do the implementation for some language?

Would someone be willing to help me out on the C language - I'm doing almost everything on my own there?

vglavnyy commented 4 years ago

We can look into supporting std::optional if we're generating for C++17, but I don't think we should generate our own optional type.

A flatbuffers optional isn't necessary if the default value of optional fields is null. I just found another way to solve the default problem, I can use the native_default attribute.

CasperN commented 4 years ago

if you explicitly mark this field as optional, you are indicating you want to be precise about its presence. If you cared for an easy API you should be using a regular field... it would give us 3 kinds of fields: standard, optional+default, optional+no_default, and that's frankly more complexity than its worth (remember, we still have to support this is 10+ languages).

I find that pretty convincing. Let's go with option (A) and the = null syntax then.

@vglavnyy can you do C++? @mzaks can you do Swift? @paulovap can you do Java? @aardappel can you do the flatc changes?

vglavnyy commented 4 years ago

The =null syntax will complicate the parser. The simplest way is using an explicit named attribute: optional or maybe. If this is =null, then the parser must be changed first. It should maps =null into a hidden attribute maybe and adds workarounds for standard scalars. Only after the tests passed we can modify the code generators.

mustiikhalil commented 4 years ago

@CasperN

  std::optional<int> mana();  // std::nullopt for non-present values

This is the change that would happen for the languages, swift doesn't need anything other than the code gen changes. since it's simply going to be translated to:

var mana: Int32? { let o = ... return o == 0 : nil ? table.getValue }

Plus if we want to add the default values for these optional fields

 int mana_or_default();  // 42 for non-present values
 inline static int kManaDefault = 42;

then the code gen should handle all of this

paulovap commented 4 years ago

One question for @CasperN

Java: Optional shows up in Java 1.8 and triggers autoboxing, so nullable pointers might be preferred too.

How nullable pointers will prevent autoboxing?

CasperN commented 4 years ago

@paulovap, I don't understand Java so maybe I should just cross that out. When I wrote that, I interpreted the "box" of "autoboxing" to mean what Rust means by "box" which is heap allocation (so a pointer into the buffer would be better), but now that I think about it, the term probably means something different in Java.

paulovap commented 4 years ago

This change would move a primitive int on Java into an Integer object, which might incur a big performance penalty.

google / flatbuffers