Open ingomueller-net opened 2 months ago
One solution is to come up with a more precise definition of expressions equality. That may be the best trade-off eventually.
However, I have the feeling that that is relatively brittle and can be difficult to implement (at least for consumers, see below). I thus proposed an alternative design to the aggregate relation on Slack that I think we should consider. The idea is to avoid repeating equal expressions altogether and express the grouping sets as references into a list of grouping expressions.
The relation would then look something like this:
message AggregateRel {
// ...
// A list of one or more grouping expression sets that the aggregation measures should be calculated for.
// Required if there are no measures.
repeated Grouping groupings = 3;
// The grouping expressions that the grouping sets refer to.
repeated Expression grouping_expressions = 5; // <--- this is new
// ...
message Grouping {
// References into `AggregateRel.grouping_expressions`.
repeated int32 grouping_expressions = 1; // <--- this is now a reference instead of an `Expression`
}
// ...
}
This avoids the definition of equality and, as a minor added benefit, compresses the plan in case of multiple grouping sets by avoiding redundant copies of grouping expressions.
However, there are also a couple of downsides:
It may be cumbersome for systems to implement this representation.
This was probably from a comment I made on Slack. I made the comment largely because most implementers start with a single grouping set (the most common case). Additionally, a lot of newer implementers might not even be aware of the concept of grouping sets. Having them normalize expressions in this way would be surprising for them. I don't think that is a showstopper but I thought it worth noting.
I agree we need to be more specific on this. I don't think we need to have the most generous definition of equality. So to concretely answer your questions:
They are considered equal if their field values are recursively equal. This is what [MessageDifferencer]...
Are the advanced_extensions fields of relations relevant for equality? yes. must be equal. Are deprecated fields part of equality if the newer fields are present? yes. must be equal Are type variations relevant for equality? yes. must be equal. Are function references equal if they have different anchors but the two referred to extensions are equivalent? no. must be same anchor.
Maybe this "simple" definition of equality may be enough for the next several years?...
However, there are also a couple of downsides: [...]
- It may be cumbersome for systems to implement this representation. Personally, I am not really convinced of this argument yet. [...]
One argument that came up on Slack (which seems to go into the direction of @jacques-n's comment) is that Substraits high-level philosophy is to "allow for dumb consumers" (see this slide deck). In the context of the current issue, one could argue that sparing consumers to deduplicate expressions would allow to make them simpler, and that, if we have to decide whether producers or consumers need to do that, we'd rather give the producers this responsibility.
I agree we need to be more specific on this. I don't think we need to have the most generous definition of equality. So to concretely answer your questions: [...]
Are the advanced_extensions fields of relations relevant for equality? yes. must be equal.
This may be more complex than it appears on a first look. A first (solvable) issue is the fact that part of AdvancedExtension
is explicitly optional. I think it would be surprising that a performance optimization that can otherwise be ignored influences the number of output column of a relation.
message AdvancedExtension {
// An optimization is helpful information that don't influence semantics. May
// be ignored by a consumer.
repeated google.protobuf.Any optimization = 1;
// ...
}
The more complex issue appears here and in quite a number of other places I hadn't previously thought of: how do we compare two protobuf.Any
messages? These are not part of the spec so it seems like we have to resort to one variant or the other of "message equality" -- with all the ambiguities mentioned above.
I looked a bit into MessageDifferencer
handles this case. I believe it can't do what we are looking for. It is somehow able to compare messages of unknown message types but it either (1) considers all unknown fields different (if in EQUIVALENT
mode), in which case even two byte-wise equal messages would be considered different, or (2) (if not using Equivalent
, considers two messages different if one uses the implicit default value and the other one sets the same value explicitly, which we almost certainly also don't want.
I hate to be a pain in the neck here but I hope it is in everybody's interest to bring up this issue now rather than running into incompatibilities later on, when the spec may be widely adopted and considered final.
I thought about this a bit more. The gravity of this problem is somewhat lower than I thought because, in many cases where Protobuf.Any
messages are used, the consumer cannot understand the plan anyways. Those include.
ReadRel.ExtensionTable.detail
ReadRel.LocalFiles.FileOrFiles.file_format.extension
ExtensionSingleRel.detail
ExtensionLeafRel.detail
ExtensionMultiRel.detail
ExtensionObject.detail
Literal.UserDefined.value
However, there are a few instances where it would actually be nice if consumers not understanding the Any
messages could just move them around as blackboxes. Those include:
ExchangeRel.ExchangeTarget.target_type.extended
AdvancedExtension.optimization
TL;DR: We can probably classify each of these into "can be ignored" and "cannot be ignored" but I maintain my argument that this makes consumers more complex than necessary.
"allow for dumb consumers"...
Yeah, good point.
One of the painful things here is that I don't see any clean way to support backwards compatibility if we want to move to the new representation. We might as well just introduce AggregatelRel2 or something as repeated fields don't work well with oneofs. (It begs the question of whether moving forward we might want to always wrap repeated fields in a message to avoid this kind of compatibility pain.) I think a lot of systems have implemented AggregatelRel already so I don't think we can do a hard breaking changes here like we might be able to do on edge cases that are rarely implemented.
We could repurpose groupings to be the ordered list of grouping expressions that we refer to and add a GroupingDetail? message that (if present) is the new grouping information with indexes to the expressions. It would be ugly to require that the groupings only contain one expression though.
We could repurpose groupings to be the ordered list of grouping expressions that we refer to and add a GroupingDetail? message that (if present) is the new grouping information with indexes to the expressions. It would be ugly to require that the groupings only contain one expression though.
It'd be awkward since groupings is a two dimensional array of expressions and we need a one dimensional array.
There seems to be an general openess to following my proposal -- if there are still arguments against it please let me know. I thus went ahead and made a concrete proposal: #706. I think it avoids the problem that @EpsilonPrime and @jacques-n have been discussing.
This issue summarizes two threads in the Slack channel, which will be behind a paywall at some point.
The aggregate relation uses a definition of expression equality that affects its semantics, including its output schema. The current definition is ambiguous and based on protobuf message equality. The first is a problem because it makes the precise semantics of the relation ambiguous; the second is problematic because the protobuf definition should be derived from the spec and not the other way around.
This is the relevant part of the spec (emphasis mine):
To elaborate the problem, let's consider an example. Let's say we have the following three grouping sets, where
a
andb
are two arbitrary but different expressions:Then the set of grouping expressions will be
(a, b)
and the output schema will have a column for each each ofa
andb
.The challenge arises during the deserialization of the protobuf messages. At that point, we do not have
a
andb
yet, but four different messages:In order to get to the representation above, we need to establish that
m1 == m3
,m2 == m4
, and !(m1 == m2)
. For different definitions of==
, we may either come to the interpretation above, i.e., that the grouping set expressions is(m1, m2)
, or any other subset of(m1, m2, m3, m4)
, most of which have different semantics and, thus, break the common understanding between systems that don't agree on==
.The current formulation of the spec allows for at least the following interpretations:
MessageDifferencer
implements. There are still multiple questions with this approach:Note that expressions may contain sub-queries so pretty much all of the spec is affected and we have to come up with a definition that works unambiguously for all of that.
For reference, #92/#100 introduced the support for arbitrary grouping expressions instead of just field references. This simplified producers because it allowed them to store the grouping expressions directly in the aggregate relation rather than having to emit a project relation with the expressions and refer to those. The spec was then updated in #258/#260 to reflect the grouping expressions and introduced the current definition of equality.