prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.79k stars 5.3k forks source link

reorder join rule does't supply the best join order for some tpc-ds sql tests #17843

Open borderlayout opened 2 years ago

borderlayout commented 2 years ago

I run some tpc-ds sqls and find the reorderJoins rule does not supply the best join order. env: presto version 0.273, it has default configuraion and config.properies adds these :

JOIN opt

optimizer.join-reordering-strategy=AUTOMATIC join-distribution-type=AUTOMATIC Test tables are on Hive (tpcds17 and tpcds25). Before run tests, all tables have statistical information(for CBO).

For example, tpcds 17: SELECT i_item_id , i_item_desc , s_state , count(ss_quantity) store_sales_quantitycount , avg(ss_quantity) store_sales_quantityave , stddev_samp(ss_quantity) store_sales_quantitystdev , (stddev_samp(ss_quantity) / avg(ss_quantity)) store_sales_quantitycov , count(sr_return_quantity) store_returns_quantitycount , avg(sr_return_quantity) store_returns_quantityave , stddev_samp(sr_return_quantity) store_returns_quantitystdev , (stddev_samp(sr_return_quantity) / avg(sr_return_quantity)) store_returns_quantitycov , count(cs_quantity) catalog_sales_quantitycount , avg(cs_quantity) catalog_sales_quantityave , stddev_samp(cs_quantity) catalog_sales_quantitystdev , (stddev_samp(cs_quantity) / avg(cs_quantity)) catalog_sales_quantitycov FROM store_sales , store_returns , catalog_sales , date_dim d1 , date_dim d2 , date_dim d3 , store , item WHERE (d1.d_quarter_name = '2001Q1') AND (d1.d_date_sk = ss_sold_date_sk) AND (i_item_sk = ss_item_sk) AND (s_store_sk = ss_store_sk) AND (ss_customer_sk = sr_customer_sk) AND (ss_item_sk = sr_item_sk) AND (ss_ticket_number = sr_ticket_number) AND (sr_returned_date_sk = d2.d_date_sk) AND (d2.d_quarter_name IN ('2001Q1', '2001Q2', '2001Q3')) AND (sr_customer_sk = cs_bill_customer_sk) AND (sr_item_sk = cs_item_sk) AND (cs_sold_date_sk = d3.d_date_sk) AND (d3.d_quarter_name IN ('2001Q1', '2001Q2', '2001Q3')) GROUP BY i_item_id, i_item_desc, s_state ORDER BY i_item_id ASC, i_item_desc ASC, s_state ASC LIMIT 100;

The store_sales, store_returns and catalog_sales are all fact tables, As we known, the row size of fact table is larger than dimension table. But presto runs the sql as follow: the Live Plan: image The store_sales join store_returns firstly. So it causes the input rows and input data are very large. image Is it because my Presto is lack of configuration or is it a bug? Thank you.

borderlayout commented 2 years ago

I also run this test on trino 383, the same sql and data. The live plan is image

image

borderlayout commented 2 years ago

presto analyze: presto.txt

trino analzye: trino.txt

borderlayout commented 2 years ago

https://github.com/prestodb/presto/pull/17849 there is a pull request for this,but after merged this pr,hive connector still has this issue。 trino384_hive_plan.txt presto273_hive_plan.txt

rohanpednekar commented 2 years ago

CC @rschlussel @fgwang7w

aaneja commented 4 months ago

@borderlayout Can you retest ? I think this has been fixed for a while. E.g see Q17 plan for the (upartitioned) TPCDS schema - https://github.com/prestodb/presto/blob/f12e3d1cd05852fdb723c2a9f0537d013cf1bcbf/presto-benchto-benchmarks/src/test/resources/sql/presto/tpcds/q17.plan.txt#L12-L24