voltrondata / spark-substrait-gateway

Implements a gateway that speaks the SparkConnect protocol and drives a backend using Substrait (over ADBC Flight SQL).
Apache License 2.0
15 stars 8 forks source link

Support "WindowFunction" with DuckDB backend #70

Open pthatte1-bb opened 1 month ago

pthatte1-bb commented 1 month ago

Spark queries with Window functions fail in the gateway with error - "window expression type not supported"

Snippet to recreate error:

get_customer_database(spark_session)
.withColumn(
    "rank",
    rank().over(Window.partitionBy(col("c_mktsegment")).orderBy(col("c_custkey"))))
.collect()

AFAIK, DuckDB DOES support Window functions. Snippet with runnable DuckDB-SQL version of above query -

SELECT 
    row_number() OVER (PARTITION BY c_mktsegment ORDER BY c_custkey), 
    * 
from 
    read_parquet('{parquet_path}')

And the Substrait spec also DOES support Window functions - https://substrait.io/expressions/window_functions/

But it looks like DuckDB-Substrait extension doesn't yet support substrait production for Window functions either. The get_substrait_json(<window-fn-query>) call using above query fails with error "duckdb.duckdb.InternalException: INTERNAL Error: WINDOW"

EpsilonPrime commented 2 weeks ago

gateway support for row_number has been added in #82. This works for Datafusion at the moment. Should automatically start working once DuckDB has window function support (and then we'll update the tests). Will leave this open until then to make sure.