substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.21k stars 160 forks source link

Precise definition of equality of grouping columns in the aggregate relation #700

Open ingomueller-net opened 2 months ago

ingomueller-net commented 2 months ago

This issue summarizes two threads in the Slack channel, which will be behind a paywall at some point.

The aggregate relation uses a definition of expression equality that affects its semantics, including its output schema. The current definition is ambiguous and based on protobuf message equality. The first is a problem because it makes the precise semantics of the relation ambiguous; the second is problematic because the protobuf definition should be derived from the spec and not the other way around.

This is the relevant part of the spec (emphasis mine):

It’s possible to specify multiple grouping sets in a single aggregate operation. The grouping sets behave more or less independently, with each returned record belonging to one of the grouping sets. The values for the grouping expression columns that are not part of the grouping set for a particular record will be set to null. Two grouping expressions will be returned using the same column if the protobuf messages describing the expressions are equal.

To elaborate the problem, let's consider an example. Let's say we have the following three grouping sets, where a and b are two arbitrary but different expressions:

(a, b)
(a)
(b)

Then the set of grouping expressions will be (a, b) and the output schema will have a column for each each of a and b.

The challenge arises during the deserialization of the protobuf messages. At that point, we do not have a and b yet, but four different messages:

(m1, m2)
(m3)
(m4)

In order to get to the representation above, we need to establish that m1 == m3, m2 == m4, and !(m1 == m2). For different definitions of ==, we may either come to the interpretation above, i.e., that the grouping set expressions is (m1, m2), or any other subset of (m1, m2, m3, m4), most of which have different semantics and, thus, break the common understanding between systems that don't agree on ==.

The current formulation of the spec allows for at least the following interpretations:

Note that expressions may contain sub-queries so pretty much all of the spec is affected and we have to come up with a definition that works unambiguously for all of that.

For reference, #92/#100 introduced the support for arbitrary grouping expressions instead of just field references. This simplified producers because it allowed them to store the grouping expressions directly in the aggregate relation rather than having to emit a project relation with the expressions and refer to those. The spec was then updated in #258/#260 to reflect the grouping expressions and introduced the current definition of equality.

ingomueller-net commented 2 months ago

One solution is to come up with a more precise definition of expressions equality. That may be the best trade-off eventually.

However, I have the feeling that that is relatively brittle and can be difficult to implement (at least for consumers, see below). I thus proposed an alternative design to the aggregate relation on Slack that I think we should consider. The idea is to avoid repeating equal expressions altogether and express the grouping sets as references into a list of grouping expressions.

The relation would then look something like this:

message AggregateRel {
  // ...

  // A list of one or more grouping expression sets that the aggregation measures should be calculated for.
  // Required if there are no measures.
  repeated Grouping groupings = 3;

  // The grouping expressions that the grouping sets refer to.
  repeated Expression grouping_expressions = 5;   // <--- this is new

  // ...

  message Grouping {
    // References into `AggregateRel.grouping_expressions`.
    repeated int32 grouping_expressions = 1;  // <--- this is now a reference instead of an `Expression`
  }

  // ...
}

This avoids the definition of equality and, as a minor added benefit, compresses the plan in case of multiple grouping sets by avoiding redundant copies of grouping expressions.

However, there are also a couple of downsides:

jacques-n commented 2 months ago

It may be cumbersome for systems to implement this representation.

This was probably from a comment I made on Slack. I made the comment largely because most implementers start with a single grouping set (the most common case). Additionally, a lot of newer implementers might not even be aware of the concept of grouping sets. Having them normalize expressions in this way would be surprising for them. I don't think that is a showstopper but I thought it worth noting.

jacques-n commented 2 months ago

I agree we need to be more specific on this. I don't think we need to have the most generous definition of equality. So to concretely answer your questions:

They are considered equal if their field values are recursively equal. This is what [MessageDifferencer]...

Are the advanced_extensions fields of relations relevant for equality? yes. must be equal. Are deprecated fields part of equality if the newer fields are present? yes. must be equal Are type variations relevant for equality? yes. must be equal. Are function references equal if they have different anchors but the two referred to extensions are equivalent? no. must be same anchor.

Maybe this "simple" definition of equality may be enough for the next several years?...

ingomueller-net commented 2 months ago

However, there are also a couple of downsides: [...]

  • It may be cumbersome for systems to implement this representation. Personally, I am not really convinced of this argument yet. [...]

One argument that came up on Slack (which seems to go into the direction of @jacques-n's comment) is that Substraits high-level philosophy is to "allow for dumb consumers" (see this slide deck). In the context of the current issue, one could argue that sparing consumers to deduplicate expressions would allow to make them simpler, and that, if we have to decide whether producers or consumers need to do that, we'd rather give the producers this responsibility.

ingomueller-net commented 2 months ago

I agree we need to be more specific on this. I don't think we need to have the most generous definition of equality. So to concretely answer your questions: [...]

Are the advanced_extensions fields of relations relevant for equality? yes. must be equal.

This may be more complex than it appears on a first look. A first (solvable) issue is the fact that part of AdvancedExtension is explicitly optional. I think it would be surprising that a performance optimization that can otherwise be ignored influences the number of output column of a relation.

message AdvancedExtension {
  // An optimization is helpful information that don't influence semantics. May
  // be ignored by a consumer.
  repeated google.protobuf.Any optimization = 1;
  // ...
}

The more complex issue appears here and in quite a number of other places I hadn't previously thought of: how do we compare two protobuf.Any messages? These are not part of the spec so it seems like we have to resort to one variant or the other of "message equality" -- with all the ambiguities mentioned above.

I looked a bit into MessageDifferencer handles this case. I believe it can't do what we are looking for. It is somehow able to compare messages of unknown message types but it either (1) considers all unknown fields different (if in EQUIVALENT mode), in which case even two byte-wise equal messages would be considered different, or (2) (if not using Equivalent, considers two messages different if one uses the implicit default value and the other one sets the same value explicitly, which we almost certainly also don't want.

I hate to be a pain in the neck here but I hope it is in everybody's interest to bring up this issue now rather than running into incompatibilities later on, when the spec may be widely adopted and considered final.

ingomueller-net commented 2 months ago

I thought about this a bit more. The gravity of this problem is somewhat lower than I thought because, in many cases where Protobuf.Any messages are used, the consumer cannot understand the plan anyways. Those include.

However, there are a few instances where it would actually be nice if consumers not understanding the Any messages could just move them around as blackboxes. Those include:

TL;DR: We can probably classify each of these into "can be ignored" and "cannot be ignored" but I maintain my argument that this makes consumers more complex than necessary.

jacques-n commented 2 months ago

"allow for dumb consumers"...

Yeah, good point.

One of the painful things here is that I don't see any clean way to support backwards compatibility if we want to move to the new representation. We might as well just introduce AggregatelRel2 or something as repeated fields don't work well with oneofs. (It begs the question of whether moving forward we might want to always wrap repeated fields in a message to avoid this kind of compatibility pain.) I think a lot of systems have implemented AggregatelRel already so I don't think we can do a hard breaking changes here like we might be able to do on edge cases that are rarely implemented.

EpsilonPrime commented 2 months ago

We could repurpose groupings to be the ordered list of grouping expressions that we refer to and add a GroupingDetail? message that (if present) is the new grouping information with indexes to the expressions. It would be ugly to require that the groupings only contain one expression though.

jacques-n commented 2 months ago

We could repurpose groupings to be the ordered list of grouping expressions that we refer to and add a GroupingDetail? message that (if present) is the new grouping information with indexes to the expressions. It would be ugly to require that the groupings only contain one expression though.

It'd be awkward since groupings is a two dimensional array of expressions and we need a one dimensional array.

ingomueller-net commented 2 months ago

There seems to be an general openess to following my proposal -- if there are still arguments against it please let me know. I thus went ahead and made a concrete proposal: #706. I think it avoids the problem that @EpsilonPrime and @jacques-n have been discussing.