tobymao / sqlglot

Python SQL Parser and Transpiler
https://sqlglot.com/
MIT License
6.38k stars 655 forks source link

Incorrect Parsing Between Spark SQL and DuckDB #3972

Closed dor-bernstein closed 2 weeks ago

dor-bernstein commented 2 weeks ago

There are a couple of mistakes when translating from Spark to DuckDB

Fully reproducible code snippet

from sqlglot import parse_one
 query = """SELECT SHA2(str, 256), ANY_VALUE(str2, true), FIRST(str3, true), LAST(str4, true), TRANSFORM(array_col)
FROM table 
GROUP BY group_col"""
 parse_one(query, dialect="spark").sql(dialect="duckdb")

The output is:

SELECT SHA2(str, 256), ANY_VALUE(str2 IGNORE NULLS), FIRST(str3 IGNORE NULLS), LAST(str4 IGNORE NULLS), TRANSFORM(array_col) FROM "table" GROUP BY group_col

Which fails for three different reasons:

  1. SHA2 is not a function in duckdb - the relevant function is sha256
  2. TRANSFORM should be LIST_TRANSFORM
  3. When using non-window function, duckdb doesn't support RESPECT/IGNORE NULLS. The default behavior is without NULLs The expected output Official Documentation
  4. https://duckdb.org/docs/sql/functions/utility.html - SHA256
  5. https://duckdb.org/docs/sql/functions/list.html - LIST_TRANSFORM
  6. https://duckdb.org/docs/sql/functions/aggregates.html - aggregate functions
VaggelisD commented 2 weeks ago

Hey @dor-bernstein,

Thanks for reporting these issues. Could you provide more context regarding TRANSFORM? From what I see, in both Spark 3 and DuckDB the signature is transform(expr, func):

spark-sql (default)> select transform(array(1, 2, 3), x -> x + 1); [2,3,4] Time taken: 3.092 seconds, Fetched 1 row(s)


- DuckDB:

D select list_transform([1, 2, 3]); Error: Binder Error: No function matches the given name and argument types 'list_transform(INTEGER[])'. You might need to add explicit type casts. Candidate functions: list_transform(ANY[], LAMBDA) -> ANY[]

LINE 1: select list_transform([1, 2, 3]); ^ D select list_transform([1, 2, 3], x -> x + 1); ┌──────────────────────────────────────────────────────────┐ │ list_transform(main.list_value(1, 2, 3), (x -> (x + 1))) │ │ int32[] │ ├──────────────────────────────────────────────────────────┤ │ [2, 3, 4] │ └──────────────────────────────────────────────────────────┘

dor-bernstein commented 2 weeks ago

@VaggelisD The signature for transforming lists is LIST_TRANSFORM (https://duckdb.org/docs/sql/functions/lambda), while parse_one returns just TRANSFORM.

Many Thanks!

VaggelisD commented 2 weeks ago

Sorry for the confusion, I meant that your example is showcasing an invalid use of Spark's TRANSFORM(expr, func) as it doesn't have a lambda defined; Is this a typo?

dor-bernstein commented 2 weeks ago

Yes, sorry. I'm using TRANSFORM(expr, func) and parse_one returns TRANFORM(expr, translated_func), while it should be LIST_TRANSFORM(expr, translated_func)