trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.37k stars 2.98k forks source link

Filter out grouping sets basing on grouping() constraints before execution #23389

Open thermo911 opened 1 month ago

thermo911 commented 1 month ago

Problem

Trino performs an aggregation for each grouping set even if particular grouping sets can be removed basing on predicates on grouping(...).

Example

Query

SELECT a, b, sum(c) FROM t GROUP BY CUBE (a, b) HAVING grouping(a, b) != 3

is effectively a

SELECT a, b, sum(c) FROM t GROUP BY GROUPING SETS ((a), (b), (a, b))

(global aggregation is removed).

Currently, Trino produces following plan for such query. Aggregation is performed for each grouping set from GroupIdNode.

Output[columnNames = [a, b, _col2]]
│   Layout: [a$gid:bigint, b$gid:bigint, sum:bigint]
│   a := a$gid
│   b := b$gid
│   _col2 := sum
└─ FilterProject[filterPredicate = ("$literal$"(from_base64('CQAAAElOVF9BUlJBWQQAAAAAAwAAAAEAAAACAAAAAAAAAA=='))[(groupid + bigint '1')] <> 3)]
   │   Layout: [a$gid:bigint, b$gid:bigint, sum:bigint]
   └─ Aggregate[type = FINAL, keys = [a$gid, b$gid, groupid]]
      │   Layout: [a$gid:bigint, b$gid:bigint, groupid:bigint, sum:bigint]
      │   sum := sum(sum_0)
      └─ LocalExchange[partitioning = HASH, arguments = [a$gid, b$gid, groupid]]
         │   Layout: [a$gid:bigint, b$gid:bigint, groupid:bigint, sum_0:row(bigint, bigint)]
         └─ RemoteExchange[type = REPARTITION]
            │   Layout: [a$gid:bigint, b$gid:bigint, groupid:bigint, sum_0:row(bigint, bigint)]
            └─ Aggregate[type = PARTIAL, keys = [a$gid, b$gid, groupid]]
               │   Layout: [a$gid:bigint, b$gid:bigint, groupid:bigint, sum_0:row(bigint, bigint)]
               │   sum_0 := sum(c)
               └─ GroupId[symbols = [[], [a], [b], [a, b]]]
                  │   Layout: [a$gid:bigint, b$gid:bigint, c:bigint, groupid:bigint]
                  │   b$gid := b
                  │   a$gid := a
                  └─ TableScan[table = iceberg:default.test$data@2456677682933822434]
                         Layout: [a:bigint, b:bigint, c:bigint]
                         a := 1:a:bigint
                         b := 2:b:bigint
                         c := 3:c:bigint

Expected behavior

In example above Trino figures out what grouping sets are actually used and removes other ones from the query plan.

wendigo commented 1 month ago

@raunaqmorarka can you take a look? :)