Open wesm opened 9 years ago
GROUPING SETS
is a useful tool for data scientists and engineers. It allows computing aggregations for different combinations of grouping keys, all in the same query.
PostgreSQL has great docs on the feature: https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUPING-SETS.
I'm not entirely sure what the API should look like, but here are a few requirements and considerations for anyone who wants to work on this:
group_by
method. It is both legitimate and useful to combine grouping sets
, cube
, rollup
, and group by
in a single query.
For example (from the above postgres docs):GROUP BY a, CUBE (b, c), GROUPING SETS ((d), (e))
which expands to
GROUP BY GROUPING SETS (
(a, b, c, d), (a, b, c, e),
(a, b, d), (a, b, e),
(a, c, d), (a, c, e),
(a, d), (a, e)
)
In my mind this rules out a method that exists only on GroupedTableExpr
(in an effort to prevent people from accessing the API without calling group_by
) because
t.group_by(...).cube()
doesn't make a whole lot of sense unless we allow group_by
to take zero arguments, itself a change that I think will lead to unmaintainable hacks.
There are two viable approaches I see.
group_by
grows the ability to accept special (new) objects ibis.grouping_sets(sequence_of_tuples)
, ibis.cube(tuple)
, ibis.rollup(tuple)
. The above postgres example would be written as:import ibis
t = ibis.table([
("a", "string"),
("b", "string"),
("c", "string"),
("d", "string"),
("money", "float64"),
])
t.group_by([
"a",
ibis.cube(("b", "c")),
ibis.grouping_sets([("d",), ("e",)])
])
I prefer this approach because it requires the least number of invasive changes and I believe will not require any hacks.
TableExpr
grows some new methods (grouping_sets
, cube
, rollup
).This is pretty tantalizing, and superficially seems to compose well until you start trying to ... compose these methods:
# using t from the previous example, seems nice but see below
t.group_by("a").cube(("b", "c")).grouping_sets([("d",), ("e",)])
# equivalent except for column ordering
t.cube(("b", "c")).group_by("a").grouping_sets([("d",), ("e",)])
# t.cube would need to have a single object that tracks all the grouping set "things"
# to make sure it doesn't generate `GROUP BY` multiple times.
Since cube
et al are limited to grouping in SQL, we should not introduce a new API that allows grouping set-like things to be use in an apparently-outside-of-group-by-but-not-actually context.
Reopening this, as we'll probably want it for full TPC-DS support (#9447).
Definite nice to have - was just looking for a way to do this today. :D
This could also capture generic
GROUPING SETS
idioms, not justROLLUP
.