When executing a query of a hive partition table (big one) inner join a tidb table(small one), the hive partition table is auto broadcasted, which leads an error.
The query is somelike
select hive_table.col1,tidb_table.col2 from hive_table inner join tidb_table on hive_table.col2=tidb_table.col3 where ...
Here I got some log info maybe helpful.
The plan.stats.sizeInBytes of the LogicalPlan of the hive table is too small and the plan.stats.sizeInBytes of LogicalPlan of the tidb table is too big.
The stats of the LogicalPlans of the two seems reversed.
Spark and TiSpark version info
Spark 3.2.3
TiSpark 3.1.2(with a profile of spark-3.2)
Additional context
Describe the bug
When executing a query of a hive partition table (big one) inner join a tidb table(small one), the hive partition table is auto broadcasted, which leads an error. The query is somelike
== Physical Plan == == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Project [... 109 more fields] +- Generate HiveGenericUDTF#udf.json.JsonExtractValueUDTF(xxx), [, ... 101 more fields], false, [...] +- Project [, ... 102 more fields] +- BroadcastHashJoin [xxx#94], [xxxx#475], Inner, BuildRight, false :- TiKV CoprocessorRDD{[table: xxx] TableReader, Columns: xxxx(): { TableRangeScan: { RangeFilter: [], Range: [([t\200\000\000\000\000\000\004\253_r\200\000\000\000\000\000\000\000], [t\200\000\000\000\000\000\004\253_s\000\000\000\000\000\000\000\000])([t\200\000\000\000\000\000\004\253_r\000\000\000\000\000\000\000\000], [t\200\000\000\000\000\000\004\253_r\200\000\000\000\000\000\000\000])] } }, startTs: 440854942292115639} EstimatedCount:20837 +- BroadcastExchange HashedRelationBroadcastMode(List(input[107, string, false]),false), [plan_id=32] +- Filter isnotnull(xxx#475) +- Scan hive xx.xxxxxxxx [, ... 100 more fields], HiveTableRelation [
xx
.xxx
, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Data Cols: [..., Partition Cols: [#520, #521, #522, #523], Pruned Partitions: [(, , , )]], [isnotnull(), (), (xx = xx)]Here I got some log info maybe helpful. The
plan.stats.sizeInBytes
of the LogicalPlan of the hive table is too small and theplan.stats.sizeInBytes
of LogicalPlan of the tidb table is too big. The stats of the LogicalPlans of the two seems reversed.Spark and TiSpark version info Spark 3.2.3 TiSpark 3.1.2(with a profile of spark-3.2) Additional context