substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.11k stars 147 forks source link

Cannot represent queries containing offsets but not limits #622

Closed vbarua closed 1 month ago

vbarua commented 3 months ago

The current FetchRel https://github.com/substrait-io/substrait/blob/b663a4aee64e6237566a1b930549d90b909e322e/proto/substrait/algebra.proto#L190-L199 as describe in the spec does not allow us to represent queries like

SELECT * FROM foo LIMIT ALL OFFSET 10;
SELECT * FROM bar OFFSET 50;

which have an offset but no (or an unlimited) count.

EpsilonPrime commented 3 months ago

If the count is not specified (i.e. omitted in the wire proto) then the semantics you are looking for are available. Is the problem that most consumers are performing a getCount() (or equivalent) instead of checking if the field exists first?

Consumers should also support count without offset which should require them to use the protos in the same way. Is this documented differently in the markdown?

EpsilonPrime commented 3 months ago

The related documentation link is here. I prefer to keep the required/optional semantics in the file (i.e. close the definition) to make this easier to keep track of for this reason. I still fondly remember proto2 which allowed one to specify optional and required directly on the fields for this reason.

EpsilonPrime commented 3 months ago

So the documentation is the problem. While we're looking at issues with the documentation is there any reason that count has to be positive? Zero seems quite reasonable from a syntactical perspective (although not generally useful in an optimized plan).

westonpace commented 3 months ago

Is the problem that most consumers are performing a getCount() (or equivalent) instead of checking if the field exists first?

I don't think optional works that way for scalar fields:

Message fields can be one of the following:

    optional: An optional field is in one of two possible states:
        the field is set, and contains a value that was explicitly set or parsed from the wire. It will be serialized to the wire.
        the field is unset, and will return the default value. It will not be serialized to the wire.`

In other words, if the field is not set, it will take on its default value. The consumer cannot distinguish between the two cases.

In protobuf (at least v3 but I'm pretty sure v2 as well) there is no way to distinguish between a 0 and a non-emitted value. See this:

# NOTE: As of proto3, HasField() only works for message fields, not for
#       singular (non-message) fields. First try to use HasField and
#       if it fails (with a ValueError) we manually consult the fields.

Then finally, this is also in the docs:

Note that for scalar message fields, once a message is parsed there’s no way of telling whether a field was explicitly set to the default value (for example whether a boolean was set to false) or just not set at all

However, we could choose to define count=0 as "do not use count" (maybe this is what you were suggesting).

EpsilonPrime commented 3 months ago

The behavior for scalar fields changed in v3.15.0 to add back optional as it was in proto2:

https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0

If we don't want to require that version (from 2021) we could utilize the "zero is not present approach".

westonpace commented 3 months ago

Interesting. Looks like it uses a oneof under the hood so I don't think this is a backwards compatible change. Still, pretty cool it can be done.

vbarua commented 3 months ago

In other words, if the field is not set, it will take on its default value. The consumer cannot distinguish between the two cases.

That's exactly the issue. We can't distinguish between unset count and 0 count. Using zero as not present / no limit is one approach to this, but then it means we can't distinguish between LIMIT 0 and LIMIT ALL.

Why someone would want LIMIT 0, I'm not sure, but it is a valid limit.

EpsilonPrime commented 3 months ago

The default behavior for fields since that version is optional. It's probably been optional for the entire existence of Substrait but we haven't treated it that way. (Also one of is implemented over optional,.)

All we need to is update the documentation to say it is optional and then update all the usages to do a has check. Technically affecting everyone so probably still worth calling breaking.

westonpace commented 3 months ago

The default behavior for fields since that version is optional. It's probably been optional for the entire existence of Substrait but we haven't treated it that way. (Also one of is implemented over optional,.)

Are you referring to this release note from 3.15?

Optional fields for proto3 are enabled by default, and no longer require the --experimental_allow_proto3_optional flag.

I interpret this as "you can now use the optional keyword without any special flag" and not "all fields are now marked optional".

This is reinforced by the fact that there is no has_count function defined in the generated C++ code (note that common, advanced_extension, and input are all non-scalar message fields):

image

vbarua commented 3 months ago

The generated Java code doesn't have a method to check nullability either.

Screenshot 2024-04-11 at 10 25 13
EpsilonPrime commented 3 months ago

It does look that way. So messages default to optional and fields default to required. That's going to take some getting used to.

So do we just go the sentinel value approach? i.e. if the value is -1 it means all?

bvolpato commented 3 months ago

Why someone would want LIMIT 0, I'm not sure, but it is a valid limit.

This is apparently a valid case if you simply want to get columns (select * from table limit 0), or simply want to check whether a query is valid or not [source]

There are arguably better ways to do it, but it is good enough that you don't need to get into database specifics.

The generated Java code doesn't have a method to check nullability either.

If the field was to be made optional, if memory serves me correctly, it should generate a hasCount method?

westonpace commented 3 months ago

If the field was to be made optional, if memory serves me correctly, it should generate a hasCount method?

Yes, but the trouble is that making a field optional is a backwards incompatible change.

So do we just go the sentinel value approach? i.e. if the value is -1 it means all?

Yes. I'm in favor of formally defining -1 as "no limit" and declaring this backwards compatible.

Technically affecting everyone so probably still worth calling breaking.

Previously valid plans will continue to work exactly as they had before. Therefore this is not a breaking change. It is a "minor change" (not a patch change) as it introduces new functionality but we don't currently distinguish between minor and patch.

vbarua commented 3 months ago

PR to address this here using -1 as a sentinel value: https://github.com/substrait-io/substrait/pull/627

One question that I think is worth asking. If we were able to make a breaking change, would we still want to use the "-1 as sentinel strategy" or would we want to switch to using optional? Is this something for tracking as a change to make for 1.0?

EpsilonPrime commented 1 month ago

Addressed in #627.