substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.14k stars 148 forks source link

feat: add custom equality behavior to the hash/merge join #585

Closed westonpace closed 7 months ago

westonpace commented 7 months ago

This PR (hopefully) concludes various discussions around flags such as null_equals_null (Datafusion) and null_aware (Velox). The goal of these flags is to slightly tweak the definition of "equality" in an equijoin relation.

This PR introduces a new EquiJoinKey message that can be used by physical join relations to define how keys should be compared.

These custom equality functions are needed in a variety of scenarios:

Optimizing set operations

Set operations (e.g. set difference) can sometimes be satisfied by an equi-join. When this happens the user typically wants the equality comparison to be "is not distinct from"

Flattening correlated subqueries

Some kinds of correlated subqueries can be removed during optimization and replaced with an anti-join. Depending on the original query ("not in" vs "where not exists") there may be slightly different behaviors with respect to null an we may want to use "might equals" as the comparison.

String collations

Collations define the ordering and equality of a column. Different columns can have different collations. The equi-join must use the comparison function defined by the collation.

vibhatha commented 7 months ago

@westonpace this would be a breaking change for existing APIs?

westonpace commented 7 months ago

@westonpace this would be a breaking change for existing APIs?

It would not be. If a plan contains the old left_keys / right_keys field then the "equal" equality method should be used. I believe this is documented.

jacques-n commented 7 months ago

I'm happy with this so +1 for me. @EpsilonPrime items feel like minor changes so I'm fine with or without them before merge.

westonpace commented 7 months ago

For the questions I had about the meaning of consistency, it could be worth adding the PR as well as it wasn't immediately clear what consistent meant.

I'll open a follow-up PR.