pingcap / tispark

TiSpark is built for running Apache Spark on top of TiDB/TiKV
Apache License 2.0
884 stars 244 forks source link

Eliminate redundant aggregation function pushed down #170

Open Novemser opened 6 years ago

Novemser commented 6 years ago
scala> spark.sql("select count(tp_int),avg(tp_int) from full_data_type_table").explain
17/12/28 14:17:13 INFO SparkSqlParser: Parsing command: select count(tp_int),avg(tp_int) from full_data_type_table
== Physical Plan ==
*HashAggregate(keys=[], functions=[sum(count(tp_int#100L)#386L), sum(sum(tp_int#100L)#387L), sum(count(tp_int#100L)#386L)])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_sum(count(tp_int#100L)#386L), partial_sum(sum(tp_int#100L)#387L), partial_sum(count(tp_int#100L)#386L)])
      +- TiDB CoprocessorRDD{[table: full_data_type_table] , Ranges: Start:[-9223372036854775808], End: [9223372036854775807], Columns: [tp_int], Aggregates: Count([tp_int]), Sum([tp_int]), Count([tp_int])}

Actually, Aggregates: Count([tp_int]), Sum([tp_int]), Count([tp_int]) could be optimized to Aggregates: Count([tp_int]), Sum([tp_int])

ilovesoup commented 6 years ago

How does this happen? Don't we have a PR that optimized this?

ilovesoup commented 6 years ago

Seems still a problem. Leave it open for now.