Closed vbarua closed 1 month ago
If the count is not specified (i.e. omitted in the wire proto) then the semantics you are looking for are available. Is the problem that most consumers are performing a getCount() (or equivalent) instead of checking if the field exists first?
Consumers should also support count without offset which should require them to use the protos in the same way. Is this documented differently in the markdown?
The related documentation link is here. I prefer to keep the required/optional semantics in the file (i.e. close the definition) to make this easier to keep track of for this reason. I still fondly remember proto2 which allowed one to specify optional and required directly on the fields for this reason.
So the documentation is the problem. While we're looking at issues with the documentation is there any reason that count has to be positive? Zero seems quite reasonable from a syntactical perspective (although not generally useful in an optimized plan).
Is the problem that most consumers are performing a getCount() (or equivalent) instead of checking if the field exists first?
I don't think optional works that way for scalar fields:
Message fields can be one of the following:
optional: An optional field is in one of two possible states:
the field is set, and contains a value that was explicitly set or parsed from the wire. It will be serialized to the wire.
the field is unset, and will return the default value. It will not be serialized to the wire.`
In other words, if the field is not set, it will take on its default value. The consumer cannot distinguish between the two cases.
In protobuf (at least v3 but I'm pretty sure v2 as well) there is no way to distinguish between a 0 and a non-emitted value. See this:
# NOTE: As of proto3, HasField() only works for message fields, not for
# singular (non-message) fields. First try to use HasField and
# if it fails (with a ValueError) we manually consult the fields.
Then finally, this is also in the docs:
Note that for scalar message fields, once a message is parsed there’s no way of telling whether a field was explicitly set to the default value (for example whether a boolean was set to false) or just not set at all
However, we could choose to define count=0
as "do not use count" (maybe this is what you were suggesting).
The behavior for scalar fields changed in v3.15.0 to add back optional as it was in proto2:
https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0
If we don't want to require that version (from 2021) we could utilize the "zero is not present approach".
Interesting. Looks like it uses a oneof under the hood so I don't think this is a backwards compatible change. Still, pretty cool it can be done.
In other words, if the field is not set, it will take on its default value. The consumer cannot distinguish between the two cases.
That's exactly the issue. We can't distinguish between unset count and 0 count. Using zero as not present / no limit is one approach to this, but then it means we can't distinguish between LIMIT 0
and LIMIT ALL
.
Why someone would want LIMIT 0
, I'm not sure, but it is a valid limit.
The default behavior for fields since that version is optional. It's probably been optional for the entire existence of Substrait but we haven't treated it that way. (Also one of is implemented over optional,.)
All we need to is update the documentation to say it is optional and then update all the usages to do a has check. Technically affecting everyone so probably still worth calling breaking.
The default behavior for fields since that version is optional. It's probably been optional for the entire existence of Substrait but we haven't treated it that way. (Also one of is implemented over optional,.)
Are you referring to this release note from 3.15?
Optional fields for proto3 are enabled by default, and no longer require the --experimental_allow_proto3_optional flag.
I interpret this as "you can now use the optional keyword without any special flag" and not "all fields are now marked optional".
This is reinforced by the fact that there is no has_count
function defined in the generated C++ code (note that common
, advanced_extension
, and input
are all non-scalar message fields):
The generated Java code doesn't have a method to check nullability either.
It does look that way. So messages default to optional and fields default to required. That's going to take some getting used to.
So do we just go the sentinel value approach? i.e. if the value is -1 it means all?
Why someone would want
LIMIT 0
, I'm not sure, but it is a valid limit.
This is apparently a valid case if you simply want to get columns (select * from table limit 0
), or simply want to check whether a query is valid or not [source]
There are arguably better ways to do it, but it is good enough that you don't need to get into database specifics.
The generated Java code doesn't have a method to check nullability either.
If the field was to be made optional, if memory serves me correctly, it should generate a hasCount
method?
If the field was to be made optional, if memory serves me correctly, it should generate a hasCount method?
Yes, but the trouble is that making a field optional is a backwards incompatible change.
So do we just go the sentinel value approach? i.e. if the value is -1 it means all?
Yes. I'm in favor of formally defining -1 as "no limit" and declaring this backwards compatible.
Technically affecting everyone so probably still worth calling breaking.
Previously valid plans will continue to work exactly as they had before. Therefore this is not a breaking change. It is a "minor change" (not a patch change) as it introduces new functionality but we don't currently distinguish between minor and patch.
PR to address this here using -1 as a sentinel value: https://github.com/substrait-io/substrait/pull/627
One question that I think is worth asking. If we were able to make a breaking change, would we still want to use the "-1 as sentinel strategy" or would we want to switch to using optional? Is this something for tracking as a change to make for 1.0?
Addressed in #627.
The current FetchRel https://github.com/substrait-io/substrait/blob/b663a4aee64e6237566a1b930549d90b909e322e/proto/substrait/algebra.proto#L190-L199 as describe in the spec does not allow us to represent queries like
which have an offset but no (or an unlimited) count.