Closed kaka11chen closed 3 months ago
@kaka11chen, Thanks for creating the issue! For the record, i understand you're going to migrate your previous PR in this area (https://github.com/prestodb/presto/pull/12183) to prestosql.
Hi @kaka11chen, do you plan to work on this PR in near future: https://github.com/trinodb/trino/pull/624 ?
@lukasz-stec has this been addressed by recent distinct aggregation improvements ?
@lukasz-stec has this been addressed by recent distinct aggregation improvements ?
Yes, I believe https://github.com/trinodb/trino/pull/21907 (rework and extension of https://github.com/trinodb/trino/pull/624) addresses this.
Reference https://github.com/prestodb/presto/issues/12024
When query use distinct aggregation on multi columns.
select count(distinct ss_item_sk), count(distinct ss_store_sk) from tpcds_bin_partitioned_orc_1000.store_sales; Result: It is very slow, cost 60 seconds in our perf-test env, regardless of use-mark-distinct.
If I change it to
select count(case when grouping_id=1 and ss_item_sk is not null then 1 else null end) as c0, count(case when grouping_id=2 and ss_store_sk is not null then 1 else null end) as c1 from (select grouping(ss_item_sk,ss_store_sk) AS grouping_id, ss_item_sk, ss_store_sk from tpcds_bin_partitioned_orc_1000.store_sales group by grouping sets (ss_item_sk, ss_store_sk)) Result: It only cost 20 seconds in our perf-test env.
I have read source code of presto, and found a rule optimization class SingleDistinctAggregationToGroupBy to handle distinct aggregation on single column case, but I didn't find the rule handle the case about multi columns.
There are similar things on Hive and Spark, such similar optimization has been implemented on these platforms. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveExpandDistinctAggregatesRule.java https://issues.apache.org/jira/browse/HIVE-10901 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala