oap-project / gazelle_plugin

Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.
Apache License 2.0
256 stars 77 forks source link

Incorrect execution result caused by join operation #1168

Open ziyangRen opened 1 year ago

ziyangRen commented 1 year ago

Describe the bug When running the following sql, the number of result data will increase. When you perform a full join on a table with 10,000 rows of data and two identical tables, the result is 13,200 entries instead of 10,000 entries.In addition, if inner join is used, the amount of data will be reduced. If left join is used, coredump error will be generated.We think that the condition for this error to recur is to use the max aggregate function on the string field before SortMergeJoin. This is because the above error is not repeated when aggregating non string fields or using sum and other aggregation functions for strings.

To Reproduce Here's the sql: select t1.value as value1, t2.value as value2, t1.data1a as data1a from( select value, MAX(data1) as data1a from gy_orc.test_smj3 t1 group by value) t1 FULL JOIN (select value, MAX(data1) as data1b from gy_orc.test_smj4 t2 group by value) t2 on t1.value=t2.value Notes:The data type of data1 is string

ziyangRen commented 1 year ago

image The execution result is as above. Rows containing null values are extra data

ziyangRen commented 1 year ago

@zhouyuan We have found the cause of this problem: we have configured spark. oap. sql. columnar. hashagg. support String=true, when aggregating String type fields, when aggregating String type fields, it is converted to the ColumnarHashAggregate operator, which results in the deletion of Sort subtree, which results in incorrect SortMergeJoin results. How can we correctly insert a Sort operator before SortMergeJoin? image

zhouyuan commented 1 year ago

Hi @ziyangRen,

Thanks for the detailed and clear log, this is a bug on the "hashagg for string" - the impl does not fit for "sortagg + SMJ" case. A quick fix is to do not allow to use hashagg in "sortagg + SMJ" case, however in this way gazelle would need to fallback to Vanilla Spark as Gazelle does not have "SortAgg" impl. I'll generate a quick patch for you.

Thanks, -yuan