tobymao / sqlglot

Python SQL Parser and Transpiler
https://sqlglot.com/
MIT License
6.69k stars 701 forks source link

Optimizer Misinterprets Union Identifiers as Columns in DuckDB Queries #4122

Closed rustyconover closed 1 month ago

rustyconover commented 1 month ago

When testing the creation of a union with DuckDB, it appears that sqlglot mistakenly treats the union identifier as a column rather than recognizing it as a general identifier when the statement is qualified in the optimizer.

DuckDB uses identifiers to tag unions, as described in official documentation

Here is how to reproduce the exception:

import sqlglot
import sqlglot.optimizer

statement = sqlglot.parse_one("select union_value(k1 := 1)", dialect="duckdb")
sqlglot.optimizer.optimize(statement).sql(dialect='duckdb')

In this case, k1 is simply an identifier—just like a key in a struct. It would be helpful if it could be handled in a case-insensitive manner, similar to other struct keys.

The exception raised is as follows:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rusty/Development/flight-cloud/.venv/lib/python3.12/site-packages/sqlglot/optimizer/optimizer.py", line 92, in optimize
    optimized = rule(optimized, **rule_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rusty/Development/flight-cloud/.venv/lib/python3.12/site-packages/sqlglot/optimizer/qualify.py", line 99, in qualify
    validate_qualify_columns_func(expression)
  File "/Users/rusty/Development/flight-cloud/.venv/lib/python3.12/site-packages/sqlglot/optimizer/qualify_columns.py", line 106, in validate_qualify_columns
    raise OptimizeError(f"Column '{column}' could not be resolved{for_table}")
sqlglot.errors.OptimizeError: Column 'k1' could not be resolved

In contrast, DuckDB handles this case correctly, as shown below:

v1.1.0 fa5c2fe15f
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select union_value(k1 := 1);
┌──────────────────────┐
│ union_value(k1 := 1) │
│  union(k1 integer)   │
├──────────────────────┤
│ 1                    │
└──────────────────────┘

Thank you for looking into this.

This behavior in DuckDB demonstrates that identifiers like k1 should be recognized as keys in a struct rather than treated as columns. Handling this more gracefully in sqlglot would align with DuckDB’s behavior.

georgesittas commented 1 month ago

@tobymao I wonder if the left-hand side of := should always be an Identifier instead of a Column. Thoughts? We could do this, or just fix union_value for DuckDB.

tobymao commented 1 month ago

it should be a var