sparkutils / quality

A Quality Spark DQ Library
https://sparkutils.github.io/quality/
Apache License 2.0
4 stars 2 forks source link

expressions using subqueries should be re-written #49

Closed chris-twiner closed 11 months ago

chris-twiner commented 1 year ago

Two of the very annoying gotchas with subqueries are driven by the re-write to "joins" i.e. one for all of the dataset plan, so if you have 20 exists or correlated or scalar you get 20 top level joins where all the fields are shared, aliases are ignored.

This means:

  1. each usage must be unique in all fields compared (they become top filters in the actual 'join' subquery) and
  2. the joining outer field must have a different name than any of those in the used subqueries

Solving 2. may be difficult to solve but would be great.. 1. could be solved in RuleLogicUtils by re-writing the plan to use unique names in a subquery . Then when the filter is constructed on analysis they are all distinct.

Any rewrites should be possible to disable via a comment / hint in the expression /* DISABLE_QUALITY_SUBQUERY_REWRITE /

chris-twiner commented 11 months ago

rewrite is:

(select  first(struct('a' ,a, 'b', b)) from tablex x where x.id = outer.id)

the rewrite

(select  first(struct('a', x.a, 'b', x.b)) from tablex x where x.id = outer.id)

does work so this is closed, the original trigger for this issue was malformed.