ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.3k stars 537 forks source link

feat: support pyarrow UDFs for pyspark backend #9074

Open jstammers opened 2 weeks ago

jstammers commented 2 weeks ago

Is your feature request related to a problem?

Pyspark now supports Arrow UDFs that facilitate efficient row-by-row executions using Arrow as a backend e.g.

import pandas as pd
from pyspark.sql.functions import udf

@udf(returnType="int",useArrow=True)
def add_one(x:int) -> int:
    return x + 1

#Create column using pyarrow-udf
df = pd.DataFrame({"a":[1,2,3]})
dfs = spark.createDataFrame(df)
dfs.withColumn("b", add_one("a")).show()

However, the equivalent function using ibis raises a NotImplementedError, because only Pandas-based vectorized UDFs are supported

import ibis
from ibis import _

@ibis.udf.scalar.pyarrow
def add_one_pyarrow(x:int) -> int:
    return x + 1

@ibis.udf.scalar.pandas
def add_one_pandas(x:int) -> int:
    return x + 1

con = ibis.pyspark.connect(spark)
con.create_table("df", df, format="delta", overwrite=True)

table = con.table("df")
table.mutate(b=add_one_pandas(_.a)).execute()
table.mutate(b=add_one_pyarrow(_.a)).execute() #raises NotImpletmentedError

What is the motivation behind your request?

Pandas-based UDFs are not supported for the DuckDB backend, but Arrow-based ones are. For my use case, I would like to ensure parity between using either backend as much as possible, so being able to use Arrow-based UDFs on a pyspark table would be very useful

Describe the solution you'd like

I'd like a solution that would allow me to use an Arrow-based UDF on a pyspark table

What version of ibis are you running?

9.0.0.dev686

What backend(s) are you using, if any?

Pyspark

Code of Conduct