prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.04k stars 5.37k forks source link

Join on expression without columns produces wrong results #7520

Closed maciejgrzybek closed 7 years ago

maciejgrzybek commented 7 years ago
select count(*) > 0 from customer join part on (cast ('a' as char(1)) = cast ('a' as char(2)));

returns true despite the fact

select cast('a' as char(1)) = cast('a' as char(2));

returns false.

This happens because:

  1. Analyzer picks the equality operator is picked in ExpressionAnalyzer (to be $operator$EQUAL(varchar(2),varchar(2)):boolean), registering necessary coercions (cast ('a' as char(1)) -> VARCHAR(2), cast ('a' as char(2)) -> VARCHAR(2))
  2. expressionOptimizer(canonicalized, metadata, session, analyzer.getExpressionTypes()).optimize(NoOpSymbolResolver.INSTANCE) fires aforementioned equal operator to compare two char fields (stored as Slice Java type).
  3. Given CHAR does not store pad-spaces, both Slices contain the exact same value - one char 'a' only. expressionOptimizer notes that, therefore
    join on (cast ('a' as char(1)) = cast ('a' as char(2))

    becomes rewritten into:

    join on true

That is because since expressionOptimize works on plan, without taking into consideration already existing coercions.

Similar situation would occur for other coercible types which share common Java type e.g. short decimal and real:

select (real '1.2' = cast(1.2 as decimal(2,1)))

returns true but:

select count(*) > 0 from customer join part on (real '1.2' = cast(1.2 as decimal(2,1)))

returns false. Mechanism is the same: equal operator is picked to be $operator$EQUAL(real,real):boolean but rewrites further on (in Analyzer) look only at the plan, not considering coercions, therefore Real.equal(long, long) is called with one long staying for Real type while another representing Short Decimal (also represented as long Java type underneath).

maciejgrzybek commented 7 years ago

I believe we could rewrite the plan to include CAST nodes whenever we decide to coerce. This would be consistent with the visitors we have in the code. I was not convinced by Analyzer rewriting the plan but it turns out we do it anyway, e.g. StatementAnalyzer.visitJoin does it as an optimization:

sopel39 commented 7 years ago

@fiedukow I think you stumbled at this issue in the past.

@maciejgrzybek We had a discussion with @martint about whether rewrites in analyzer are fine, see: https://github.com/prestodb/presto/issues/6883. Analyzer should provide proper feedback to the user. If you rewrite the nodes during analysis, then for instance error messages might not be valid anymore (e.g: point to non-exiting AST nodes). Additionally, Analysis is based on AST nodes identity and rewritten expressions won't map in Analysis. For complex ORDER BY expressions rewrites in analyzer didn't work (because of coercions and invalid error messages referencing to wrong nodes). This is tricky subject at least. @martint might provide more input.

Maybe optimization code should be removed from statementAnalyzer#visitJoin (there is the only call of this signature of com.facebook.presto.sql.planner.ExpressionInterpreter#expressionOptimizer) and moved to a separate optimizer after initial planning? The coercions should be present by then.

maciejgrzybek commented 7 years ago

The argument for user-feedback is not quite correct with existing optimization (rewrite expressions into 0 = 0 or 0 = 1).

I will move the optimization to later phase (planner or optimizer) and post a PR to discuss more around the code, if needed.

sopel39 commented 7 years ago

The argument for user-feedback is not quite correct with existing optimization (rewrite expressions into 0 = 0 or 0 = 1).

Yeah, that rewrite probably shouldn't be in StatementAnalyzer too

martint commented 7 years ago

The analyzer's job is to validate the query and annotate the AST with any information that the planner is going to need to go from AST->IR (AST is untyped. IR should, ideally, be fully typed, with all coercions resolved). We still do some rewrites in the analyzer due to legacy reasons, but those should move to proper optimizers.

maciejgrzybek commented 7 years ago

7529 should to the trick.