ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.16k stars 591 forks source link

feat: grouping sets functionality #550

Open wesm opened 9 years ago

cpcloud commented 2 years ago

This could also capture generic GROUPING SETS idioms, not just ROLLUP.

cpcloud commented 2 years ago

GROUPING SETS is a useful tool for data scientists and engineers. It allows computing aggregations for different combinations of grouping keys, all in the same query.

PostgreSQL has great docs on the feature: https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUPING-SETS.

I'm not entirely sure what the API should look like, but here are a few requirements and considerations for anyone who wants to work on this:

GROUP BY a, CUBE (b, c), GROUPING SETS ((d), (e))

which expands to

GROUP BY GROUPING SETS (
    (a, b, c, d), (a, b, c, e),
    (a, b, d),    (a, b, e),
    (a, c, d),    (a, c, e),
    (a, d),       (a, e)
)

In my mind this rules out a method that exists only on GroupedTableExpr (in an effort to prevent people from accessing the API without calling group_by) because

t.group_by(...).cube()

doesn't make a whole lot of sense unless we allow group_by to take zero arguments, itself a change that I think will lead to unmaintainable hacks.

There are two viable approaches I see.

  1. group_by grows the ability to accept special (new) objects ibis.grouping_sets(sequence_of_tuples), ibis.cube(tuple), ibis.rollup(tuple). The above postgres example would be written as:
import ibis

t = ibis.table([
    ("a", "string"),
    ("b", "string"),
    ("c", "string"),
    ("d", "string"),
    ("money", "float64"),
])

t.group_by([
    "a",
    ibis.cube(("b", "c")),
    ibis.grouping_sets([("d",), ("e",)])
])

I prefer this approach because it requires the least number of invasive changes and I believe will not require any hacks.

  1. TableExpr grows some new methods (grouping_sets, cube, rollup).

This is pretty tantalizing, and superficially seems to compose well until you start trying to ... compose these methods:

# using t from the previous example, seems nice but see below
t.group_by("a").cube(("b", "c")).grouping_sets([("d",), ("e",)])

# equivalent except for column ordering
t.cube(("b", "c")).group_by("a").grouping_sets([("d",), ("e",)])
# t.cube would need to have a single object that tracks all the grouping set "things"
# to make sure it doesn't generate `GROUP BY` multiple times.

Since cube et al are limited to grouping in SQL, we should not introduce a new API that allows grouping set-like things to be use in an apparently-outside-of-group-by-but-not-actually context.

jcrist commented 3 months ago

Reopening this, as we'll probably want it for full TPC-DS support (#9447).

landonAcuityMD commented 3 months ago

Definite nice to have - was just looking for a way to do this today. :D