voltrondata / spark-substrait-gateway

Implements a gateway that speaks the SparkConnect protocol and drives a backend using Substrait (over ADBC Flight SQL).
Apache License 2.0
15 stars 8 forks source link

Support Spark's "struct" function #67

Open pthatte1-bb opened 1 month ago

pthatte1-bb commented 1 month ago

Usages of the SparkSQL function def struct(cols: Column*): Column fail with an error message - "Exception iterating responses: Function struct not found in the Spark to Substrait mapping table."

EpsilonPrime commented 1 month ago

Support for structured data types is not strong in the engines with Substrait support. We will need to improve the backend support before we can make further progress here. In addition to supporting the type itself there are a myriad of operations (including access as if the struct was a dictionary using square brackets) that should also be implemented.

pat70 commented 1 month ago

The linked PR uses ExtensionFunctions to map unresolved_function when a backend supports a specific function (in this case - the struct function).

I've set the PR to draft for now and I'm trying to grok the comment about "supporting the type itself". Does this refer to Compound Types on this page: https://substrait.io/types/type_classes/#compound-types

EpsilonPrime commented 1 month ago

The struct type is defined (and documented in the compound types section). Implementing struct() on its own is fine -- it will need to be done eventually. The problem is what comes after you have the struct as we don't have anything that works on them.

pthatte1-bb commented 1 month ago

Re: what comes after you have the struct

The requested functionality unblocks some struct usages in our existing code for ColumnGroup-style handling. This snippet shows an oversimplified example of what is requested, and it runs locally using the linked-draft-PR's changes

(
    get_customer_database(spark_session)
    .select(struct(col('c_custkey'), col('c_name')).alias('test_struct'))
    .agg(min(col('test_struct').getField('c_custkey')))
    .show()
)
EpsilonPrime commented 1 month ago

DuckDB did add more struct support to Substrait this week but I believe we need nested expressions to handle this properly. Turns out the substrait-validator doesn't have support for nested expressions either. So support will need to be added to the validator and DuckDB. I've filed a request for DuckDB to look into nested expressions.

EpsilonPrime commented 1 month ago

Nested support has been added DuckDB's Substrait implementation today. It will be in their next release. So now just the validator and the gateway need updating.

pat70 commented 1 month ago

I see the commits. I'm taking a crack at another PR for the gateway.

pat70 commented 1 month ago

Re: Nested support has been added DuckDB's Substrait implementation today.

FYI I tested locally and this is the generated substrait that DuckDB accepts - image

EpsilonPrime commented 1 week ago

When DuckDB's release lands (in the next week or two) I'll take a crack at this.