Optimize distinct aggregation on multi column

kaka11chen commented 5 years ago

Reference https://github.com/prestodb/presto/issues/12024

When query use distinct aggregation on multi columns.

select count(distinct ss_item_sk), count(distinct ss_store_sk) from tpcds_bin_partitioned_orc_1000.store_sales; Result: It is very slow, cost 60 seconds in our perf-test env, regardless of use-mark-distinct.

If I change it to

select count(case when grouping_id=1 and ss_item_sk is not null then 1 else null end) as c0, count(case when grouping_id=2 and ss_store_sk is not null then 1 else null end) as c1 from (select grouping(ss_item_sk,ss_store_sk) AS grouping_id, ss_item_sk, ss_store_sk from tpcds_bin_partitioned_orc_1000.store_sales group by grouping sets (ss_item_sk, ss_store_sk)) Result: It only cost 20 seconds in our perf-test env.

I have read source code of presto, and found a rule optimization class SingleDistinctAggregationToGroupBy to handle distinct aggregation on single column case, but I didn't find the rule handle the case about multi columns.

findepi commented 5 years ago

@kaka11chen, Thanks for creating the issue! For the record, i understand you're going to migrate your previous PR in this area (https://github.com/prestodb/presto/pull/12183) to prestosql.

amoghmargoor commented 1 year ago

Hi @kaka11chen, do you plan to work on this PR in near future: https://github.com/trinodb/trino/pull/624 ?

raunaqmorarka commented 3 months ago

@lukasz-stec has this been addressed by recent distinct aggregation improvements ?

lukasz-stec commented 3 months ago

@lukasz-stec has this been addressed by recent distinct aggregation improvements ?

Yes, I believe https://github.com/trinodb/trino/pull/21907 (rework and extension of https://github.com/trinodb/trino/pull/624) addresses this.

trinodb / trino

Optimize distinct aggregation on multi column #613