trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.47k stars 3.01k forks source link

Optimize distinct aggregation on multi column #613

Closed kaka11chen closed 3 months ago

kaka11chen commented 5 years ago

Reference https://github.com/prestodb/presto/issues/12024

When query use distinct aggregation on multi columns.

select count(distinct ss_item_sk), count(distinct ss_store_sk) from tpcds_bin_partitioned_orc_1000.store_sales; Result: It is very slow, cost 60 seconds in our perf-test env, regardless of use-mark-distinct.

If I change it to

select count(case when grouping_id=1 and ss_item_sk is not null then 1 else null end) as c0, count(case when grouping_id=2 and ss_store_sk is not null then 1 else null end) as c1 from (select grouping(ss_item_sk,ss_store_sk) AS grouping_id, ss_item_sk, ss_store_sk from tpcds_bin_partitioned_orc_1000.store_sales group by grouping sets (ss_item_sk, ss_store_sk)) Result: It only cost 20 seconds in our perf-test env.

I have read source code of presto, and found a rule optimization class SingleDistinctAggregationToGroupBy to handle distinct aggregation on single column case, but I didn't find the rule handle the case about multi columns.

There are similar things on Hive and Spark, such similar optimization has been implemented on these platforms. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveExpandDistinctAggregatesRule.java https://issues.apache.org/jira/browse/HIVE-10901 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

findepi commented 5 years ago

@kaka11chen, Thanks for creating the issue! For the record, i understand you're going to migrate your previous PR in this area (https://github.com/prestodb/presto/pull/12183) to prestosql.

amoghmargoor commented 1 year ago

Hi @kaka11chen, do you plan to work on this PR in near future: https://github.com/trinodb/trino/pull/624 ?

raunaqmorarka commented 3 months ago

@lukasz-stec has this been addressed by recent distinct aggregation improvements ?

lukasz-stec commented 3 months ago

@lukasz-stec has this been addressed by recent distinct aggregation improvements ?

Yes, I believe https://github.com/trinodb/trino/pull/21907 (rework and extension of https://github.com/trinodb/trino/pull/624) addresses this.