voltrondata / spark-substrait-gateway

Implements a gateway that speaks the SparkConnect protocol and drives a backend using Substrait (over ADBC Flight SQL).
Apache License 2.0
15 stars 8 forks source link

Support "join" operation using column-name strings #60

Closed pthatte1-bb closed 3 weeks ago

pthatte1-bb commented 1 month ago

In-memory DataFrames can be created successfully, but cannot be used in dataframe joins.

Snippet of supported feature:

df_customer = get_customer_database(spark_session)
df_temp = spark_session.createDataFrame([(131074, 'Alice'), (131075, 'Bob')], ['join_custkey', 'name'])
df_temp.join(df_customer, on=col("c_custkey").eqNullSafe(col("join_custkey"))).drop("join_custkey").show()

Snippet of requested feature:

df_customer = get_customer_database(spark_session)
df_temp = spark_session.createDataFrame([(131074, 'Alice'), (131075, 'Bob')], ['c_custkey', 'name'])
df_temp.join(df_customer, on="c_custkey").show()
EpsilonPrime commented 3 weeks ago

It turns out the feature missing here has nothing to do with virtual tables -- the on column name feature was not implemented.