trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.14k stars 2.92k forks source link

Syntax error when referencing a column name from a table by identical name in ORDER BY #18726

Open lukaseder opened 1 year ago

lukaseder commented 1 year ago

I'm using trino 424 from here: https://hub.docker.com/r/trinodb/trino.

This query should work, in my opinion:

select a.a
from (values (1)) as a (a)
order by a.a

But it throws:

SQL Error [58]: Query failed (#20230818_085125_57701_286wh): line 3:10: Expression a is not of type ROW

Avoiding the qualification in ORDER BY works around the issue:

select a.a
from (values (1)) as a (a)
order by a

The problem also doesn't appear when table and column names are different:

select a.b
from (values (1)) as a (b)
order by a.b
findepi commented 1 year ago

The following query is valid per SQL specification

SELECT ... AS foo
FROM ...
ORDER BY foo

that's because ORDER BY should be analyzed in scope that "sees" SELECT expressions.

Thus, if table has column a and SELECT has a select item aliased to a, the ORDER BY refers to the latter.

This intentional, is per-spec behavior and tested e.g. here https://github.com/trinodb/trino/blob/b1362b1c968554fb36a2484108df2ed7df6ff626/testing/trino-testing/src/main/java/io/trino/testing/AbstractTestOrderByQueries.java#L74

the remaining part which i am less sure about is what should be the alias established for SELECT a.a when no explicit alias is provide (no AS a).

lukaseder commented 1 year ago

I guess this problem originates from syntax usage like this?

select a.a.a
from (select (cast(row(1, 2) as row(a int, b int)))) as a (a)

As far as I know, other ORDBMS (e.g. Oracle, PostgreSQL) don't allow for dereferencing ROW attributes like this, they'd require (a.a).a for syntactic disambiguation.

Most SQL implementations aren't ORDBMS, so they don't run into that edge case where a column expression and a row attribute reference expression create an ambiguity, so when coming from other dialects, this particular edge case seems very unexpected.

So, I'd say, if there's no formal ambiguity (an actual ROW type is involved), shouldn't Trino then do "the right thing" and resolve for the qualified column identifier, which is defined everywhere else in the query?

lukaseder commented 1 year ago

Here's a fun workaround, btw:

select (select a.a)
from (values (1)) as a (a)
order by a.a;

Because the correlated subquery produces an alias called _col0, the ambiguity doesn't appear, and a.a is now clearly resolved as I expected.

findepi commented 1 year ago

nice trick!

is it possible to assign explicit aliases directly in the original query?

lukaseder commented 1 year ago

is it possible to assign explicit aliases directly in the original query?

You probably mean this?

select a.a as b
from (values (1)) as a (a)
order by a.a

Yes, that works and produces:

|b  |
|---|
|1  |

So, indeed, as you suspected

the remaining part which i am less sure about is what should be the alias established for SELECT a.a when no explicit alias is provide (no AS a).

This implicit alias is what's causing the confusion here.

martint commented 1 year ago

This implicit alias is what's causing the confusion here.

Indeed, that's exactly what's happening. The SELECT a.a clause is equivalent to SELECT a.a AS a. Then the a.a in the ORDER BY binds to the name of the output column of the SELECT clause (i.e., a) as dictated by the SQL standard, so the dereference is invalid.

I need to look at the spec more closely regarding the rules around implicit aliases. But if I recall correctly, they are "implementation-dependent".

lukaseder commented 1 year ago

What's confusing me in particular is that I'm not aware of a single other RDBMS that makes SELECT a.a AS a hide the qualified identifier a.a for good. In most RDBMS, both a.a and a are now defined in ORDER BY. I can even define both, because after all, top level selects don't have this requirement of producing unambiguous identifiers.

All of these work:

Db2, Derby, DuckDB, H2, HSQLDB, Oracle, PostgreSQL, SQL Server:

SELECT a.a, a.a AS a
FROM (VALUES (2),(1)) AS a (a)
ORDER BY a.a

MySQL:

SELECT a.a, a.a AS a
FROM (VALUES ROW (2), ROW (1)) AS a (a)
ORDER BY a.a

MariaDB, SQLite:

SELECT a.a, a.a AS a
FROM (SELECT 2 AS a UNION SELECT 1 AS a) AS a
ORDER BY a.a

Firebird:

SELECT a.a, a.a AS a
FROM (SELECT 2 AS a FROM RDB$DATABASE UNION SELECT 1 AS a FROM RDB$DATABASE) AS a
ORDER BY a.a

Whereas TRINO (probably a related error that shouldn't be there, IMO):

SQL Error [62]: Query failed (#20230821_064924_00005_kn4k4): line 3:10: Column 'a' is ambiguous

I understand that there can be some ambiguity when there's a nested ROW expression present as I've illustrated here: https://github.com/trinodb/trino/issues/18726#issuecomment-1683820762. Again, I think that other RDBMS avoid the ambiguity by not allowing the ordinary path syntax for row attribute dereferencing (requiring parentheses, e.g. (a.a).a).

In any case, if there is no such ROW expression, then there's no ambiguity in my opinion.