substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.16k stars 150 forks source link

Order of output columns of the `Aggregate` operation with multiple grouping sets. #693

Closed ingomueller-net closed 3 weeks ago

ingomueller-net commented 3 weeks ago

I believe that the Aggregate operation is underspecified. In particular, in the case with multiple grouping sets the specification does not seem to say in which order the output columns should be. The relevant part currently reads as follows:

It’s possible to specify multiple grouping sets in a single aggregate operation. The grouping sets behave more or less independently, with each returned record belonging to one of the grouping sets. The values for the grouping expression columns that are not part of the grouping set for a particular record will be set to null. Two grouping expressions will be returned using the same column if they represent the protobuf messages describing the expressions are equal. The columns for grouping expressions that do not appear in all grouping sets will be nullable (regardless of the nullability of the type returned by the grouping expression) to accomodate the null insertion.

As an example, the grouping sets could be:

(a)
(b)
(b, a)

Then the possible output column orders would be (a, b) or (b, a) and I think the specification say which one.

EpsilonPrime commented 3 weeks ago

The direct output order section says:

The list of distinct columns from each grouping set (ordered by their first appearance) followed by the list of measures in declaration order, followed by an i32 describing the associated particular grouping set the value is derived from (if applicable).

ingomueller-net commented 3 weeks ago

Oh, indeed, it does. Thanks for pointing me to that part! 🙈