kwai / blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
Apache License 2.0
1.3k stars 121 forks source link

Consider inequality joins #576

Open richox opened 2 months ago

richox commented 2 months ago

Is your feature request related to a problem? Please describe. currently we do not support inequality joins (hash join and sort-merge join). it is hard to implement such feature because datafusion has no direct supports to row-based evaluation.

Describe the solution you'd like Describe alternatives you've considered

  1. simulate row-based evaluation with one-row columnar evaluation, which has super low performance in practice. in cases where the equality pred has filtered away most records, this method may work. but if the equality pred takes no effects (like tpcds q72). the query will hang.
  2. supports limited row-based filter in datafusion. currently datafusion already has some supports like make_comparator to build a row-based evaluator. we can extend it to support more row-based evaluations, like make_binary_op etc.
  3. fallback the post-filter evaluation to spark, and use codegen to speedup the evaluation. but we also have to consider the fallback overheads.

Additional context