prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

Incorrect results produced by mixed distinct aggregations #8894

Open raghavsethi opened 7 years ago

raghavsethi commented 7 years ago

COUNT(*) with COUNT(DISTINCT x) produces NULL for COUNT(*) when all input data is filtered out.

presto:tiny> SELECT COUNT(DISTINCT linenumber), COUNT(*) from lineitem where linenumber < 0;
 _col0 | _col1
-------+-------
     0 | NULL
(1 row)
kokosing commented 7 years ago

I have just stared Presto (21953a4) in Intellij on default settings (presto-main/etc) and it works for me. I tried multiple times with different schemas and it worked all the time. Can you please share more details, because I am not able to reproduce it:

presto:tiny> select count(distinct linenumber), count(*) from lineitem where linenumber < 0;
 _col0 | _col1 
-------+-------
     0 |     0 
(1 row)

Query 20170831_045938_00007_u4g6v, FINISHED, 1 node
Splits: 53 total, 53 done (100,00%)
0:00 [60,2K rows, 0B] [395K rows/s, 0B/s]

presto:tiny> explain select count(distinct linenumber), count(*) from lineitem where linenumber < 0;
                                                                                                       Query Plan                                                            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 - Output[_col0, _col1] => [count:bigint, count_1:bigint]                                                                                                                    
         _col0 := count                                                                                                                                                      
         _col1 := count_1                                                                                                                                                    
     - Aggregate(FINAL) => [count_1:bigint, count:bigint]                                                                                                                    
             count_1 := "count"("count_10")                                                                                                                                  
             count := "count"("count_9")                                                                                                                                     
         - LocalExchange[SINGLE] () => count_10:bigint, count_9:bigint                                                                                                       
             - RemoteExchange[GATHER] => count_10:bigint, count_9:bigint                                                                                                     
                 - Aggregate(PARTIAL) => [count_10:bigint, count_9:bigint]                                                                                                   
                         count_10 := "count"(*)                                                                                                                              
                         count_9 := "count"(DISTINCT "linenumber") (mask = linenumber$distinct)                                                                              
                     - MarkDistinct[distinct=linenumber:integer marker=linenumber$distinct] => [linenumber:integer, linenumber$distinct:boolean]                             
                         - Project[] => [linenumber:integer]                                                                                                                 
                                 Cost: {rows: 30087, bytes: ?}                                                                                                               
                             - LocalExchange[HASH][$hashvalue] ("linenumber") => linenumber:integer, $hashvalue:bigint                                                       
                                     Cost: {rows: 30087, bytes: ?}                                                                                                           
                                 - RemoteExchange[REPARTITION][$hashvalue_11] => linenumber:integer, $hashvalue_11:bigint                                                    
                                         Cost: {rows: 30087, bytes: ?}                                                                                                       
                                     - ScanFilterProject[table = tpch:tpch:lineitem:sf0.01, originalConstraint = ("linenumber" < 0), filterPredicate = ("linenumber" < 0)] =>
                                             Cost: {rows: 60175, bytes: ?}/{rows: 30087, bytes: ?}/{rows: 30087, bytes: ?}                                                   
                                             $hashvalue_12 := "combine_hash"(BIGINT '0', COALESCE("$operator$hash_code"("linenumber"), 0))                                   
                                             linenumber := tpch:linenumber                                                                                                   

(1 row)

Query 20170831_045951_00008_u4g6v, FINISHED, 1 node
Splits: 1 total, 1 done (100,00%)
0:09 [0 rows, 0B] [0 rows/s, 0B/s]
raghavsethi commented 7 years ago

You need to set optimize mixed distinct aggregations to true. On Wed, Aug 30, 2017 at 10:05 PM Grzegorz Kokosiński < notifications@github.com> wrote:

I have just stared Presto (21953a4 https://github.com/prestodb/presto/commit/21953a4c8436d918481af443d88e9321189112dd) in Intellij on default settings (presto-main/etc) and it works for me. I tried multiple times with different schemas and it worked all the time. Can you please share more details, because I am not able to reproduce it:

presto:tiny> select count(distinct linenumber), count(*) from lineitem where linenumber < 0; _col0 | _col1 -------+------- 0 | 0 (1 row)

Query 20170831_045938_00007_u4g6v, FINISHED, 1 node Splits: 53 total, 53 done (100,00%) 0:00 [60,2K rows, 0B] [395K rows/s, 0B/s]

presto:tiny> explain select count(distinct linenumber), count(*) from lineitem where linenumber < 0; Query Plan

  • Output[_col0, _col1] => [count:bigint, count_1:bigint] _col0 := count _col1 := count_1
    • Aggregate(FINAL) => [count_1:bigint, count:bigint] count_1 := "count"("count_10") count := "count"("count_9")
      • LocalExchange[SINGLE] () => count_10:bigint, count_9:bigint
        • RemoteExchange[GATHER] => count_10:bigint, count_9:bigint
          • Aggregate(PARTIAL) => [count_10:bigint, count_9:bigint] count_10 := "count"(*) count_9 := "count"(DISTINCT "linenumber") (mask = linenumber$distinct)
            • MarkDistinct[distinct=linenumber:integer marker=linenumber$distinct] => [linenumber:integer, linenumber$distinct:boolean]
              • Project[] => [linenumber:integer] Cost: {rows: 30087, bytes: ?}
                • LocalExchange[HASH][$hashvalue] ("linenumber") => linenumber:integer, $hashvalue:bigint Cost: {rows: 30087, bytes: ?}
                  • RemoteExchange[REPARTITION][$hashvalue_11] => linenumber:integer, $hashvalue_11:bigint Cost: {rows: 30087, bytes: ?}
                    • ScanFilterProject[table = tpch:tpch:lineitem:sf0.01, originalConstraint = ("linenumber" < 0), filterPredicate = ("linenumber" < 0)] => Cost: {rows: 60175, bytes: ?}/{rows: 30087, bytes: ?}/{rows: 30087, bytes: ?} $hashvalue_12 := "combine_hash"(BIGINT '0', COALESCE("$operator$hash_code"("linenumber"), 0)) linenumber := tpch:linenumber

(1 row)

Query 20170831_045951_00008_u4g6v, FINISHED, 1 node Splits: 1 total, 1 done (100,00%) 0:09 [0 rows, 0B] [0 rows/s, 0B/s]

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/prestodb/presto/issues/8894#issuecomment-326190714, or mute the thread https://github.com/notifications/unsubscribe-auth/AArsDELsHDbUdFQfPMp7DOaF-DUy4iJJks5sdj8bgaJpZM4PIFQX .