pingcap / tispark

TiSpark is built for running Apache Spark on top of TiDB/TiKV
Apache License 2.0
880 stars 244 forks source link

[BUG] An exception occurred while hive table join tidb table #2670

Closed shanhai3000 closed 1 year ago

shanhai3000 commented 1 year ago

Describe the bug

When executing a query of a hive partition table (big one) inner join a tidb table(small one), the hive partition table is auto broadcasted, which leads an error. The query is somelike

select hive_table.col1,tidb_table.col2 from hive_table inner join tidb_table on hive_table.col2=tidb_table.col3 where ...

== Physical Plan == == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Project [... 109 more fields] +- Generate HiveGenericUDTF#udf.json.JsonExtractValueUDTF(xxx), [, ... 101 more fields], false, [...] +- Project [, ... 102 more fields] +- BroadcastHashJoin [xxx#94], [xxxx#475], Inner, BuildRight, false :- TiKV CoprocessorRDD{[table: xxx] TableReader, Columns: xxxx(): { TableRangeScan: { RangeFilter: [], Range: [([t\200\000\000\000\000\000\004\253_r\200\000\000\000\000\000\000\000], [t\200\000\000\000\000\000\004\253_s\000\000\000\000\000\000\000\000])([t\200\000\000\000\000\000\004\253_r\000\000\000\000\000\000\000\000], [t\200\000\000\000\000\000\004\253_r\200\000\000\000\000\000\000\000])] } }, startTs: 440854942292115639} EstimatedCount:20837 +- BroadcastExchange HashedRelationBroadcastMode(List(input[107, string, false]),false), [plan_id=32] +- Filter isnotnull(xxx#475) +- Scan hive xx.xxxxxxxx [, ... 100 more fields], HiveTableRelation [xx.xxx, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Data Cols: [..., Partition Cols: [#520, #521, #522, #523], Pruned Partitions: [(, , , )]], [isnotnull(), (), (xx = xx)]

Here I got some log info maybe helpful. The plan.stats.sizeInBytes of the LogicalPlan of the hive table is too small and the plan.stats.sizeInBytes of LogicalPlan of the tidb table is too big. The stats of the LogicalPlans of the two seems reversed.

Spark and TiSpark version info Spark 3.2.3 TiSpark 3.1.2(with a profile of spark-3.2) Additional context